milenkovicm commented on issue #17494: URL: https://github.com/apache/datafusion/issues/17494#issuecomment-3335751755
I had a look at this query yesterday (same 10 scale). Spark 4.0 takes 15 seconds, compared to datafusion which takes 3 mins 40 seconds. Comparing two plans there are two notable differences, spark uses `BroadcastHashJoin` for all joins apart from most inner, for which it uses `SortMergeJoin`. If forced to use `ShuffledHashJoin` instead of `SortMergeJoin` time overall time increases from 15 seconds to like 1 min 30 seconds, which is still faster than DataFusion. Lastly, if `BroadcastHashJoin` is disabled, performance will decrease further to around 2 minutes 55 seconds , which still faster than DataFusion. DataFusion does repartitioning after each join, and I haven't really tried to force `CollectLeft`. Overall join ordering does look same between spark and datafusion apart from selected physical joins. Ballista is slightly faster than DataFusion with 3 minutes 20 seconds due to join re-ordering within stages, unfortunately `CollectLeft` is broken in ballista like @Dandandan pointed out in https://github.com/apache/datafusion-ballista/issues/1055 and never PRed 😃 Hope it helps -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
