alamb commented on issue #9547:
URL: 
https://github.com/apache/arrow-datafusion/issues/9547#issuecomment-1988179761

   Hi @Lordworms  -- thank you for this analysis. 
   
   > (seems like we did not really do parallism and I really think that's some 
problem comes from Tokio)
   
   I do not agree with this statement in general (though it may be that TPCH 
parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU 
/ parallelism and while tokio results in more complicated stack traces for 
sure, I think overall the benfits are worth it. 
   
   We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper 
(https://github.com/apache/arrow-datafusion/issues/6782) 
[DataFusion_Query_Engine___SIGMOD_2024.pdf](https://github.com/apache/arrow-datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf)
 where we compared single core efficiency and scaling (see the results 
section). We found areas that each engine did better in.
   
   If your goal is to improve the performance of DataFusion in the TPCH queries 
I have some thoughts:
   1. The TPCH benchmark has many large joins. Thus the efficiency of the both 
the join plans and the join operators (e.g. `HashJoinExec`) is important for 
good TPCH
   2. The level of optimization that has been invested into DataFusion joins is 
relatively low compared to aggregationing and filtering (see 
https://github.com/apache/arrow-datafusion/issues/8398 for a list of potential 
ideas)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to