alamb commented on PR #8530: URL: https://github.com/apache/arrow-rs/pull/8530#issuecomment-3518415320
Reposting from a [discord thread ](https://discord.com/channels/885562378132000778/1314936346653102080/1437878033213034526)from @corasaurus-hex to get it a bit more out there (a happy customer of the new parser @etseidl ) > I have a bit of a wild problem I'm working on for work, and have been on and off over the past month. I'm using an extreme example of the problem to prove out performance before we dig in further -- essentially if I can't get the startup performance (the point at which it starts emitting records) fast enough for the more extreme cases then it's not worth going further > > With datafusion 50.0.0 the startup performance of this query was ~950ms for uncompressed parquet, and ~850ms for compressed parquet. > With datafusion 51.0.0 the startup performance of this query is 450ms for uncompressed, and consistently 270ms for compressed > > I think the improvements mainly come from arrow 57's parquet parser -- this query is combining 5,153 parquet files > > datafusion 51 has essentially made this project workable, and this project is my big swing for the year that I've been fighting to get approved and will simplify tons of infrastructure > > so thanks for all the effort everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
