Re: [I] `parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data [arrow-rs]

via GitHub Wed, 25 Sep 2024 10:49:54 -0700


etseidl commented on issue #6454:
URL: https://github.com/apache/arrow-rs/issues/6454#issuecomment-2374680541


   > Have you enabled the page index?
   
   Indeed. Or enabled v2 page headers? The issue seems to be that when skipping 
rows (`skip_records` defines a record as rep_level == 0, so a row), the number 
of rows per page is not known in advance, so to figure out the number of levels 
to skip, the repetition levels need to be decoded for every page. For V1 pages, 
unfortunately, the level information is compressed along with the page data, so 
the entire page needs decompressing to calculate the number of rows. If either 
of V2 page headers or the page index were enabled, then the number of rows per 
page is known without having to do decompression, so entire pages can be 
skipped with very little effort (the continue at L330 above).
   
   I don't think pages are uncompressed twice...it's just a result of the two 
paths through `ParquetRecordBatchReader::next` (call `call skip_records until 
enough have been skipped, then switch over to `read_records`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] `parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data [arrow-rs]

Reply via email to