thinkharderdev commented on PR #10738: URL: https://github.com/apache/datafusion/pull/10738#issuecomment-2165588841
> > @alamb is there any documentation on what it means for DataFusion to "scan" specific rows within a row group? Does it actually read only those rows? I'd imagine that because of some mix of compression and limitations of byte range fetches to contiguous bytes for object stores you end up streaming entire row groups anyway. > > Specifically, DataFusion uses this API: https://github.com/apache/arrow-rs/blob/0cc14168000e1e41fc5f63929d34d13dda6e5873/parquet/src/arrow/arrow_reader/mod.rs#L137-L194 > > Which if you have the PageIndex (which is written by default in the parquet rs writer) the reader may be able to skip certain pages Yeah so conceptually how it works is that once we have a `RowSelection` we can 1. If there is a `PageIndex`, we can compare the `RowSelection` to the `PageIndex` and fetch only the data pages which contain selected rows (and hence prune IO) 2. While decoding the data pages that were fetched we can skip decoding of rows that were not selected. Depending on the exact datatype this can be more or less useful. For something that is delta encoded, you can't really skip decoding within mini-blocks so it probably doesn't make a huge difference, but with a fixed-size datatype you can skip over an arbitrary number of rows by just jumping directly to the next selected row and potentially save a bunch of CPU cycles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org