[GitHub] [spark] mcdull-zhang opened a new pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

GitBox Wed, 15 Dec 2021 04:45:49 -0800


mcdull-zhang opened a new pull request #34908:
URL: https://github.com/apache/spark/pull/34908



   ### What changes were proposed in this pull request?
   
   Each child of the union handles data skew separately.
   
   
   ### Why are the changes needed?
   `OptimizeSkewedJoin` rule will take effect only when the plan has two 
ShuffleQueryStageExec.
   
   With `Union`, it might break the assumption. For example, the following plans
   
   <b>scenes 1</b>
   ```
   Union
       SMJ
           ShuffleQueryStage
           ShuffleQueryStage
       SMJ
           ShuffleQueryStage
           ShuffleQueryStage
   ```
   
   <b>scenes 2</b>
   ```
   Union
       SMJ
           ShuffleQueryStage
           ShuffleQueryStage
       HashAggregate
   ```
   when one or more of the SMJ data in the above plan is skewed, it cannot be 
processed at present.
   
   It's better to support partial optimize with Union.
   
   ### Does this PR introduce any user-facing change?
   
   Probably yes, the result partition might changed.
   
   ### How was this patch tested?
   
   Add test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mcdull-zhang opened a new pull request #34908: [SPARK-37652][SQL]Support optimize skewed join through union

Reply via email to