Re: [I] Efficiency Problem: Parallelization and vectorization [arrow-datafusion]

via GitHub Mon, 11 Mar 2024 04:05:27 -0700


alamb commented on issue #9547:
URL: 
https://github.com/apache/arrow-datafusion/issues/9547#issuecomment-1988179761

Hi @Lordworms -- thank you for this analysis.

> (seems like we did not really do parallism and I really think that's some
problem comes from Tokio)

I do not agree with this statement in general (though it may be that TPCH
parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU
/ parallelism and while tokio results in more complicated stack traces for
sure, I think overall the benfits are worth it.

We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper
(https://github.com/apache/arrow-datafusion/issues/6782)
[DataFusion_Query_Engine___SIGMOD_2024.pdf](https://github.com/apache/arrow-datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf)
where we compared single core efficiency and scaling (see the results
section). We found areas that each engine did better in.

If your goal is to improve the performance of DataFusion in the TPCH queries
I have some thoughts:
1. The TPCH benchmark has many large joins. Thus the efficiency of the both
the join plans and the join operators (e.g. `HashJoinExec`) is important for
good TPCH
2. The level of optimization that has been invested into DataFusion joins is
relatively low compared to aggregationing and filtering (see
https://github.com/apache/arrow-datafusion/issues/8398 for a list of potential
ideas)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Efficiency Problem: Parallelization and vectorization [arrow-datafusion]

Reply via email to