Lordworms commented on issue #9547: URL: https://github.com/apache/arrow-datafusion/issues/9547#issuecomment-1988955945
> I run DF on a c7i.48xlarge instance type in aws (192 cores, 384GB RAM) and during my processing I'm seeing almost 100% cpu usage across the board. So parallelism in my usecase is essentially perfect - though I can't speak for the efficiency. > >  Yes, I runed > Hi @Lordworms -- thank you for this analysis. > > > (seems like we did not really do parallism and I really think that's some problem comes from Tokio) > > I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it. > > We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (#6782) [DataFusion_Query_Engine___SIGMOD_2024.pdf](https://github.com/apache/arrow-datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf) where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in. > > If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts: > > 1. The TPCH benchmark has many large joins. Thus the efficiency of the both the join plans and the join operators (e.g. `HashJoinExec`) is important for good TPCH > 2. The level of optimization that has been invested into DataFusion joins is relatively low compared to aggregationing and filtering (see [[Epic] A collection of Join Improvements #8398](https://github.com/apache/arrow-datafusion/issues/8398) for a list of potential ideas) > Hi @Lordworms -- thank you for this analysis. > > > (seems like we did not really do parallism and I really think that's some problem comes from Tokio) > > I do not agree with this statement in general (though it may be that TPCH parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU / parallelism and while tokio results in more complicated stack traces for sure, I think overall the benfits are worth it. > > We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper (#6782) [DataFusion_Query_Engine___SIGMOD_2024.pdf](https://github.com/apache/arrow-datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf)EditSign where we compared single core efficiency and scaling (see the results section). We found areas that each engine did better in. > > If your goal is to improve the performance of DataFusion in the TPCH queries I have some thoughts: > > 1. The TPCH benchmark has many large joins. Thus the efficiency of the both the join plans and the join operators (e.g. `HashJoinExec`) is important for good TPCH > 2. The level of optimization that has been invested into DataFusion joins is relatively low compared to aggregationing and filtering (see [[Epic] A collection of Join Improvements #8398](https://github.com/apache/arrow-datafusion/issues/8398) for a list of potential ideas) Got it , I'll check those issues -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
