zhuqi-lucas commented on issue #21399:
URL: https://github.com/apache/datafusion/issues/21399#issuecomment-4195997989

   > > The new arrow-rs APIs (peek/skip) are for a different purpose: 
dynamically skipping row groups during execution based on the TopK threshold. 
The access plan is fixed before the decoder starts — once it begins reading, it 
processes all selected RGs in order. But after reading the first RG, TopK sets 
a tight threshold (e.g., id > 999991), and the remaining 19 RGs can be skipped 
entirely. Without peek/skip, the decoder has no way to stop mid-file.
   > 
   > I think this will be sorted out by the morsels API: we can do additional 
data skipping before each morsel (still some overhead but nowhere near as much 
as reading the row group)
   
   
   Ok, if i make sense right,  with morsels providing natural checkpoints 
between row groups, we can check the dynamic filter before scheduling each 
morsel — no arrow-rs changes needed? If so, much cleaner than what I prototyped 
(which hacked a boundary signal into the push decoder).
   
   Could you point me to the morsels API design doc or tracking issue? I'd like 
to understand the execution model to plan how dynamic RG skipping would 
integrate, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to