zhuqi-lucas commented on issue #21399:
URL: https://github.com/apache/datafusion/issues/21399#issuecomment-4195977596
Thanks @adriangb , you're right that RG reordering doesn't need new arrow-rs
APIs — we can reorder the ParquetAccessPlan before building the decoder, just
like you described. That's the right approach for #21317.
The new arrow-rs APIs (peek/skip) are for a different purpose: dynamically
skipping row groups during execution based on the TopK threshold. The access
plan is fixed before the decoder starts — once it begins reading, it processes
all selected RGs in order. But after reading the first RG, TopK sets a tight
threshold (e.g., id > 999991), and the remaining 19 RGs can be skipped
entirely. Without peek/skip, the decoder has no way to stop mid-file.
These are complementary:
1. Reorder (your suggestion, no arrow-rs change): put the best RGs first
so TopK gets a tight threshold quickly
2. Dynamic skip (arrow-rs peek/skip): after threshold is set, skip
remaining RGs that can't contain qualifying rows — no I/O, no decode
I verified this locally — with dynamic RG pruning on a 20-RG file, ORDER BY
id DESC LIMIT 10 reads only 1 RG instead of 20 (19 skipped, 95% IO saved, 4.5x
faster).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]