[GitHub] [arrow-rs] tustvold commented on issue #2270: Changes to ParquetRecordBatchStream to support row filtering in DataFusion

GitBox Mon, 01 Aug 2022 15:51:23 -0700


tustvold commented on issue #2270:
URL: https://github.com/apache/arrow-rs/issues/2270#issuecomment-1201811291


   I need to think more on this, but some immediate thoughts that may or may 
not make sense:
   
   * This system will only be able to pushdown to eliminate decode overheads, 
i.e. it will be unable to eliminate IO to fetch data (which is fine we have the 
page index for that)
   * I wonder if it would be simpler to push the predicate into the 
ParquetRecordBatchReader that way you don't need to futz around with async, and 
would also potentially eventually allow predicate evaluation on encoded data
   * We probably need to work on a way to represent predicates within parquet 
directly so that all the various pruning, skipping logic can be centralised and 
not spread across two repos
   * The nature of parquet is such that skipping runs of rows less than the 
normal batch_size may in fact be slower than just reading them normally. This 
means if we don't determine the ranges up front, we'll need some way to bail 
out if it gets too expensive
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #2270: Changes to ParquetRecordBatchStream to support row filtering in DataFusion

Reply via email to