[GitHub] [arrow-datafusion] Dandandan commented on issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

GitBox Tue, 08 Nov 2022 01:16:24 -0800


Dandandan commented on issue #4139:
URL: 
https://github.com/apache/arrow-datafusion/issues/4139#issuecomment-1306880712


   Sounds like a good plan.
   
   For hash join, probably needs some benchmarking to figure out good defaults 
and avoid performance degradation. `CollectLeft` limits the amount of 
parellization on the left side: building the hash table is relatively expensive 
and is done (at least currently) in a single thread. In quite a few cases it 
might be more beneficial to do a (local) hash repartitioning which is 
relatively cheap.
   It also depends on the size of the probe/right side: if that's e.g. >100x as 
big as the left side it might be beneficial to avoid the hash repartitioning on 
the right side by switching to `CollectLeft`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #4139: JoinSelection Rule to choose physical join implementation: HashJoin(Partitioned or CollectLeft) or SortMergeJoin base on Stats

Reply via email to