[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

GitBox Fri, 11 Sep 2020 00:52:16 -0700


AngersZhuuuu commented on pull request #29692:
URL: https://github.com/apache/spark/pull/29692#issuecomment-690936960



   > Spark query optimizer should not add the extra shuffle by itself, as it's 
likely to cause perf regression.
   
   With this rule, we can't handle such data skew case automatic.  With strict 
and reasonable conf value, extra shuffle 's cost is much less than the overhead 
of data skew.
   
   Especially like broadcast join/broadcast nested loop join. if stream side 
executing end with a group by(There are many such business scenarios) and 
always data skew seriously.  Getting business people to tune each job is 
difficult.
   For the community, what do you think about this scene


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] AngersZhuuuu commented on pull request #29692: [SPARK-32830][SQL] Optimize Skewed BroadcastNestedLoopJoin with AQE

Reply via email to