Re: [PR] Use custom thrift parser for parquet metadata (phase 1 of Thrift remodel) [arrow-rs]

via GitHub Tue, 11 Nov 2025 11:19:21 -0800


alamb commented on PR #8530:
URL: https://github.com/apache/arrow-rs/pull/8530#issuecomment-3518415320


   Reposting from a [discord thread 
](https://discord.com/channels/885562378132000778/1314936346653102080/1437878033213034526)from
 @corasaurus-hex to get it a bit more out there (a happy customer of the new 
parser @etseidl )
   
   > I have a bit of a wild problem I'm working on for work, and have been on 
and off over the past month. I'm using an extreme example of the problem to 
prove out performance before we dig in further -- essentially if I can't get 
the startup performance (the point at which it starts emitting records) fast 
enough for the more extreme cases then it's not worth going further
   > 
   > With datafusion 50.0.0 the startup performance of this query was ~950ms 
for uncompressed parquet, and ~850ms for compressed parquet.
   > With datafusion 51.0.0 the startup performance of this query is 450ms for 
uncompressed, and consistently 270ms for compressed
   >
   > I think the improvements mainly come from arrow 57's parquet parser -- 
this query is combining 5,153 parquet files
   > 
   > datafusion 51 has essentially made this project workable, and this project 
is my big swing for the year that I've been fighting to get approved and will 
simplify tons of infrastructure
   > 
   > so thanks for all the effort everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Use custom thrift parser for parquet metadata (phase 1 of Thrift remodel) [arrow-rs]

Reply via email to