etseidl commented on issue #6454: URL: https://github.com/apache/arrow-rs/issues/6454#issuecomment-2374680541
> Have you enabled the page index? Indeed. Or enabled v2 page headers? The issue seems to be that when skipping rows (`skip_records` defines a record as rep_level == 0, so a row), the number of rows per page is not known in advance, so to figure out the number of levels to skip, the repetition levels need to be decoded for every page. For V1 pages, unfortunately, the level information is compressed along with the page data, so the entire page needs decompressing to calculate the number of rows. If either of V2 page headers or the page index were enabled, then the number of rows per page is known without having to do decompression, so entire pages can be skipped with very little effort (the continue at L330 above). I don't think pages are uncompressed twice...it's just a result of the two paths through `ParquetRecordBatchReader::next` (call `call skip_records until enough have been skipped, then switch over to `read_records`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
