Re: [I] Efficiency Problem: Parallelization and vectorization [arrow-datafusion]

via GitHub Mon, 11 Mar 2024 09:59:19 -0700


Lordworms commented on issue #9547:
URL: 
https://github.com/apache/arrow-datafusion/issues/9547#issuecomment-1988955945


   > I run DF on a c7i.48xlarge instance type in aws (192 cores, 384GB RAM) and 
during my processing I'm seeing almost 100% cpu usage across the board. So 
parallelism in my usecase is essentially perfect - though I can't speak for the 
efficiency.
   > 
   > 
![image](https://private-user-images.githubusercontent.com/226258/311723497-ba28daa5-f6a9-4f9c-9373-81063e01fac9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTAxNzY1NDcsIm5iZiI6MTcxMDE3NjI0NywicGF0aCI6Ii8yMjYyNTgvMzExNzIzNDk3LWJhMjhkYWE1LWY2YTktNGY5Yy05MzczLTgxMDYzZTAxZmFjOS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwMzExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDMxMVQxNjU3MjdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wYWQxMTAwOThlMjdhZTgwZTY3MThmZTIyYTEzYWRjZjFlMGRiNTgyNzYyZmRhNDg2ZjA4YzIxYmFmMmNmYzk1JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.cVl6iZXzXr7uLIBvKBiKlr6x8-tjy5O6RY7wwAY9hjM)
   
   Yes, I runed
   
   > Hi @Lordworms -- thank you for this analysis.
   > 
   > > (seems like we did not really do parallism and I really think that's 
some problem comes from Tokio)
   > 
   > I do not agree with this statement in general (though it may be that TPCH 
parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU 
/ parallelism and while tokio results in more complicated stack traces for 
sure, I think overall the benfits are worth it.
   > 
   > We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper 
(#6782) 
[DataFusion_Query_Engine___SIGMOD_2024.pdf](https://github.com/apache/arrow-datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf)
 where we compared single core efficiency and scaling (see the results 
section). We found areas that each engine did better in.
   > 
   > If your goal is to improve the performance of DataFusion in the TPCH 
queries I have some thoughts:
   > 
   > 1. The TPCH benchmark has many large joins. Thus the efficiency of the 
both the join plans and the join operators (e.g. `HashJoinExec`) is important 
for good TPCH
   > 2. The level of optimization that has been invested into DataFusion joins 
is relatively low compared to aggregationing and filtering (see [[Epic] A 
collection of Join Improvements 
#8398](https://github.com/apache/arrow-datafusion/issues/8398) for a list of 
potential ideas)
   
   
   
   > Hi @Lordworms -- thank you for this analysis.
   > 
   > > (seems like we did not really do parallism and I really think that's 
some problem comes from Tokio)
   > 
   > I do not agree with this statement in general (though it may be that TPCH 
parallelism could be improved), -- DataFusion uses a signfiicant amount of CPU 
/ parallelism and while tokio results in more complicated stack traces for 
sure, I think overall the benfits are worth it.
   > 
   > We did a comparison of DataFusion and DuckDB in our upcoming SIGMOD paper 
(#6782) 
[DataFusion_Query_Engine___SIGMOD_2024.pdf](https://github.com/apache/arrow-datafusion/files/13874720/DataFusion_Query_Engine___SIGMOD_2024.pdf)EditSign
 where we compared single core efficiency and scaling (see the results 
section). We found areas that each engine did better in.
   > 
   > If your goal is to improve the performance of DataFusion in the TPCH 
queries I have some thoughts:
   > 
   > 1. The TPCH benchmark has many large joins. Thus the efficiency of the 
both the join plans and the join operators (e.g. `HashJoinExec`) is important 
for good TPCH
   > 2. The level of optimization that has been invested into DataFusion joins 
is relatively low compared to aggregationing and filtering (see [[Epic] A 
collection of Join Improvements 
#8398](https://github.com/apache/arrow-datafusion/issues/8398) for a list of 
potential ideas)
   
   Got it , I'll check those issues


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Efficiency Problem: Parallelization and vectorization [arrow-datafusion]

Reply via email to