alamb commented on issue #9547: URL: https://github.com/apache/arrow-datafusion/issues/9547#issuecomment-1988179761
Hi @Lordworms -- thank you for this analysis. > (seems like we did not really do parallism and I really think that's some problem comes from Tokio) I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it. We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (https://github.com/apache/arrow-datafusion/issues/6782) [DataFusion_Query_Engine___SIGMOD_2024.pdf](https://github.com/apache/arrow-datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf) where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in. If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts: 1. The TPCH benchmark has many large joins. Thus the efficiency of the both the join plans and the join operators (e.g. `HashJoinExec`) is important for good TPCH 2. The level of optimization that has been invested into DataFusion joins is relatively low compared to aggregationing and filtering (see https://github.com/apache/arrow-datafusion/issues/8398 for a list of potential ideas) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
