Re: [I] TPC-DS query 72 slow on modest (10) scale factors [datafusion]

via GitHub Thu, 25 Sep 2025 13:10:19 -0700


milenkovicm commented on issue #17494:
URL: https://github.com/apache/datafusion/issues/17494#issuecomment-3335751755


   I had a look at this query yesterday (same 10 scale).  Spark 4.0 takes 15 
seconds, compared to datafusion which takes 3 mins 40 seconds. 
   
   Comparing two plans there are two notable differences, spark uses 
`BroadcastHashJoin` for all joins apart from most inner, for which it uses 
`SortMergeJoin`. If forced to use `ShuffledHashJoin` instead of `SortMergeJoin` 
time overall time increases from 15 seconds to like 1 min 30 seconds, which is 
still faster than DataFusion. 
   
   Lastly, if `BroadcastHashJoin` is disabled, performance will decrease 
further to around 2 minutes 55 seconds , which still faster than DataFusion.  
   
   DataFusion does repartitioning after each join, and I haven't really tried 
to force `CollectLeft`. 
   
   Overall join ordering does look same between spark and datafusion apart from 
selected physical joins.
   
   Ballista is slightly faster than DataFusion with 3 minutes 20 seconds due to 
join re-ordering within stages, unfortunately `CollectLeft` is broken in 
ballista like @Dandandan pointed out in 
https://github.com/apache/datafusion-ballista/issues/1055 and never PRed 😃
   
   Hope it helps 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] TPC-DS query 72 slow on modest (10) scale factors [datafusion]

Reply via email to