Re: [PR] Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection [datafusion]

via GitHub Thu, 13 Jun 2024 05:56:29 -0700


thinkharderdev commented on PR #10738:
URL: https://github.com/apache/datafusion/pull/10738#issuecomment-2165588841


   > > @alamb is there any documentation on what it means for DataFusion to 
"scan" specific rows within a row group? Does it actually read only those rows? 
I'd imagine that because of some mix of compression and limitations of byte 
range fetches to contiguous bytes for object stores you end up streaming entire 
row groups anyway.
   > 
   > Specifically, DataFusion uses this API: 
https://github.com/apache/arrow-rs/blob/0cc14168000e1e41fc5f63929d34d13dda6e5873/parquet/src/arrow/arrow_reader/mod.rs#L137-L194
   > 
   > Which if you have the PageIndex (which is written by default in the 
parquet rs writer) the reader may be able to skip certain pages
   
   Yeah so conceptually how it works is that once we have a `RowSelection` we 
can 
   1. If there is a `PageIndex`, we can compare the `RowSelection` to the 
`PageIndex` and fetch only the data pages which contain selected rows (and 
hence prune IO)
   2. While decoding the data pages that were fetched we can skip decoding of 
rows that were not selected. Depending on the exact datatype this can be more 
or less useful. For something that is delta encoded, you can't really skip 
decoding within mini-blocks so it probably doesn't make a huge difference, but 
with a fixed-size datatype you can skip over an arbitrary number of rows by 
just jumping directly to the next selected row and potentially save a bunch of 
CPU cycles. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection [datafusion]

Reply via email to