[GitHub] [spark] c21 commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

GitBox Sat, 24 Apr 2021 21:10:15 -0700


c21 commented on pull request #32210:
URL: https://github.com/apache/spark/pull/32210#issuecomment-826250121



   > Apparently falling back to SMJ wastes the partially-built hash map.
   
   Note the fall back only happens when there's no memory available to add new 
entry to hash map. So even without enabling shuffled hash join by default for 
all queries, from my point of view, this is still an improvement for 
reliability right now if people selectively enable shuffled hash join for 
certain queries. People do not need to worry about OOM for shuffled hash join. 
If we think this is not the right direction to go, I think same argument going 
with hash aggregate fallback. We enable hash aggregate by default, and if the 
query keeps falling back from hash aggregate to sort-based aggregate, then this 
is not as efficient as just using sort aggregate for the query.
   
   > If one partition is a bit larger to build the in-memory hash map, I feel 
spilling the hash map might be a better choice?
   
   Per my knowledge I don't know any obviously efficient way to do random 
lookup join will spilled hash map. How do we minimize random disk read for 
spilled map? Happy to brainstorm more if you have some rough ideas.
   
   > If one partition is much larger to build the in-memory hash map, seems we 
can use the same technique of skew join handling, to split the partition into 
multiple smaller ones so that they can fit in memory.
   
   Again I feel this goes back to the AQE hybrid join discussion. The same 
disadvantage applies here - only work for join with shuffle. We cannot reliably 
enable shuffled hash join then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on pull request #32210: [SPARK-32634][SQL] Introduce sort-based fallback for shuffled hash join (non-code-gen path)

Reply via email to