alamb commented on issue #11442: URL: https://github.com/apache/datafusion/issues/11442#issuecomment-2228160735
> Do you have a list in mind the area that is worth for performance improvement? Somethings I known that are still active in my head In my mind, here are somre "obvious" performance projects (the ones I have the most confidence that would make a meaningful difference on ClickBench or TPCH queries) are as follows (I can maybe put this in the documentation) ## Integrate StringView into Parquet / Filtering / Grouping * #10918 @XiangpengHao is doing this as his summer project and doing an amazing job. I also think this is a great example of the the level of effort required to drive one of these performance projects. It requires implementing the features, then analyzing / profiling, identifying the bottlenecks, and then making PRs to remove the bottlenecks. ee #10918 and https://github.com/apache/arrow-rs/issues/5374 have the entire list. Some of my favorites: * https://github.com/apache/arrow-rs/pull/6009 * https://github.com/apache/arrow-rs/issues/6034 **What**: Use newly added `StringView` from arrow to improve performance (by avoiding variable length/string data copies) **Why**: This For queries that deal with string data in ClickBench or TPCH a large amount of time is spent in parquet decoding as well as filtering and grouping. **What is left**: See #10918 and https://github.com/apache/arrow-rs/issues/5374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org