[GitHub] [spark] maropu commented on pull request #29655: [SPARK-32806][SQL] SortMergeJoin with partial hash distribution can be optimized to remove shuffle

GitBox Mon, 14 Sep 2020 18:40:42 -0700


maropu commented on pull request #29655:
URL: https://github.com/apache/spark/pull/29655#issuecomment-692409151



   > we added #19054 in our internal fork and don't see much OOM issues.
   
   Even so, I think removing shuffles in the middles of stages (e.g., many join 
cases) can make the prob. of OOM higher in theory in case of data skew. Since 
we can control input distributions somewhat, e.g.,  by the bucketing technique, 
it might be worth trying the restrictive approach that @imback82 suggested 
above, I think.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] maropu commented on pull request #29655: [SPARK-32806][SQL] SortMergeJoin with partial hash distribution can be optimized to remove shuffle

Reply via email to