tustvold commented on issue #2270: URL: https://github.com/apache/arrow-rs/issues/2270#issuecomment-1201811291
I need to think more on this, but some immediate thoughts that may or may not make sense: * This system will only be able to pushdown to eliminate decode overheads, i.e. it will be unable to eliminate IO to fetch data (which is fine we have the page index for that) * I wonder if it would be simpler to push the predicate into the ParquetRecordBatchReader that way you don't need to futz around with async, and would also potentially eventually allow predicate evaluation on encoded data * We probably need to work on a way to represent predicates within parquet directly so that all the various pruning, skipping logic can be centralised and not spread across two repos * The nature of parquet is such that skipping runs of rows less than the normal batch_size may in fact be slower than just reading them normally. This means if we don't determine the ranges up front, we'll need some way to bail out if it gets too expensive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
