alamb commented on issue #8441: URL: https://github.com/apache/arrow-rs/issues/8441#issuecomment-3355766516
> Thank you [@alamb](https://github.com/alamb) and [@etseidl](https://github.com/etseidl). These are great results! If not too hard can you include results from skipping all index structures _except_ statistics? That would bring it closer to comparing with the flatbuf proposal which includes statistics. The majority of the improvement reported results from not reading the [PageIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md) (OffsetIndex and ColumnIndex), which is possible to do with arrow 56 as well. If by statistics you mean the [statistics on ColumnChunks](https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912-L939) then using the new decoder and skipping the PageIndex is between 4.5 and 5x faster than reading both the metadata and PageIndex. For example, at 100k string columns, it is 4.6x faster * arrow 56: 3639ms (1314ms for footer and 2325ms for PageIndex) * arrow 57: 776ms (just 776ms for the footer) Full data can be [found here in this spreadsheet](https://docs.google.com/spreadsheets/d/1Ypsox5EywNmv9ORwrlmJlWcPVvWlOW_QCnIt_U68vbo/edit?gid=1818026620#gid=1818026620) <img width="1699" height="272" alt="Image" src="https://github.com/user-attachments/assets/9f0cc10b-e109-4a54-ae79-b3371fd60ef2" /> | Speedup | |--------| | 4.950819672 | | 5.166666667 | | 5.28358209 | | 4.915068493 | | 4.959349593 | | 5.166666667 | | 4.901408451 | | 4.68943299 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
