mcdull-zhang opened a new pull request #34908:
URL: https://github.com/apache/spark/pull/34908
### What changes were proposed in this pull request?
Each child of the union handles data skew separately.
### Why are the changes needed?
`OptimizeSkewedJoin` rule will take effect only when the plan has two
ShuffleQueryStageExec.
With `Union`, it might break the assumption. For example, the following plans
<b>scenes 1</b>
```
Union
SMJ
ShuffleQueryStage
ShuffleQueryStage
SMJ
ShuffleQueryStage
ShuffleQueryStage
```
<b>scenes 2</b>
```
Union
SMJ
ShuffleQueryStage
ShuffleQueryStage
HashAggregate
```
when one or more of the SMJ data in the above plan is skewed, it cannot be
processed at present.
It's better to support partial optimize with Union.
### Does this PR introduce any user-facing change?
Probably yes, the result partition might changed.
### How was this patch tested?
Add test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]