alamb commented on issue #9846: URL: https://github.com/apache/arrow-datafusion/issues/9846#issuecomment-2034578501
> Is there a rule of thumb for choosing SMJ over HJ? I believe current state of the art in query processing is 1. If the data is already sorted by join keys, use MergeJoin (as @Dandandan says) 2. If the data is not already sorted on join key, use HashJoin 3. If HashJoin runs out of memory building the hash table, spill the table to disk (possibly switching to merge join internally) The only benefit SMJ has over HJ at the moment in Datafusion is that we could plausibly join relations that are larger than memory using SMJ (using the fact that we can spill the inputs) -- this may be what @Dandandan is saying in https://github.com/apache/arrow-datafusion/issues/9846#issuecomment-2034369728 I think it is close to impossible to make MJ beat HJ for performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
