Re: [I] [Epic] Dynamic row group pruning using TopK threshold during parquet scan [datafusion]

via GitHub Mon, 06 Apr 2026 18:49:30 -0700


zhuqi-lucas commented on issue #21399:
URL: https://github.com/apache/datafusion/issues/21399#issuecomment-4195977596


   Thanks @adriangb , you're right that RG reordering doesn't need new arrow-rs 
APIs — we can reorder the ParquetAccessPlan before building the decoder, just 
like you described. That's the right approach for #21317.
   
   The new arrow-rs APIs (peek/skip) are for a different purpose: dynamically 
skipping row groups during execution based on the TopK threshold. The access 
plan is fixed before the decoder starts — once it begins reading, it processes 
all selected RGs in order. But after reading the first RG, TopK sets a tight 
threshold (e.g., id > 999991), and the remaining 19 RGs can be skipped 
entirely. Without peek/skip, the decoder has no way to stop mid-file.
   
   These are complementary:
     1. Reorder (your suggestion, no arrow-rs change): put the best RGs first 
so TopK gets a tight threshold quickly
     2. Dynamic skip (arrow-rs peek/skip): after threshold is set, skip 
remaining RGs that can't contain qualifying rows — no I/O, no decode
   
   I verified this locally — with dynamic RG pruning on a 20-RG file, ORDER BY 
id DESC LIMIT 10 reads only 1 RG instead of 20 (19 skipped, 95% IO saved, 4.5x 
faster).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Epic] Dynamic row group pruning using TopK threshold during parquet scan [datafusion]

Reply via email to