AngersZhuuuu commented on pull request #29692: URL: https://github.com/apache/spark/pull/29692#issuecomment-690936960
> Spark query optimizer should not add the extra shuffle by itself, as it's likely to cause perf regression. With this rule, we can't handle such data skew case automatic. With strict and reasonable conf value, extra shuffle 's cost is much less than the overhead of data skew. Especially like broadcast join/broadcast nested loop join. if stream side executing end with a group by(There are many such business scenarios) and always data skew seriously. Getting business people to tune each job is difficult. For the community, what do you think about this scene ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
