alamb commented on PR #10738: URL: https://github.com/apache/datafusion/pull/10738#issuecomment-2163999195
> @alamb is there any documentation on what it means for DataFusion to "scan" specific rows within a row group? Does it actually read only those rows? I'd imagine that because of some mix of compression and limitations of byte range fetches to contiguous bytes for object stores you end up streaming entire row groups anyway. Specifically, DataFusion uses this API: https://github.com/apache/arrow-rs/blob/0cc14168000e1e41fc5f63929d34d13dda6e5873/parquet/src/arrow/arrow_reader/mod.rs#L137-L194 Which if you have the PageIndex (which is written by default in the parquet rs writer) the reader may be able to skip certain pages -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org