alamb commented on issue #8441:
URL: https://github.com/apache/arrow-rs/issues/8441#issuecomment-3355766516

   > Thank you [@alamb](https://github.com/alamb) and 
[@etseidl](https://github.com/etseidl). These are great results! If not too 
hard can you include results from skipping all index structures _except_ 
statistics? That would bring it closer to comparing with the flatbuf proposal 
which includes statistics.
   
   The majority of the improvement reported results from not reading the 
[PageIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md) 
(OffsetIndex and ColumnIndex), which is possible to do with arrow 56 as well. 
   
   If by statistics you mean the [statistics on 
ColumnChunks](https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912-L939)
 then using the new decoder and skipping the PageIndex is between 4.5 and 5x 
faster than reading both the metadata and PageIndex.
   
   For example, at 100k string columns, it is 4.6x faster
   * arrow 56: 3639ms (1314ms for footer and 2325ms for PageIndex)
   * arrow 57: 776ms (just 776ms for the footer)
   
   Full data can be [found here in this 
spreadsheet](https://docs.google.com/spreadsheets/d/1Ypsox5EywNmv9ORwrlmJlWcPVvWlOW_QCnIt_U68vbo/edit?gid=1818026620#gid=1818026620)
   
   <img width="1699" height="272" alt="Image" 
src="https://github.com/user-attachments/assets/9f0cc10b-e109-4a54-ae79-b3371fd60ef2";
 />
   
   | Speedup |
   |--------|
   | 4.950819672 |
   | 5.166666667 |
   | 5.28358209 |
   | 4.915068493 |
   | 4.959349593 |
   | 5.166666667 |
   | 4.901408451 | 
   | 4.68943299 |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to