Liulietong commented on pull request #34602:
URL: https://github.com/apache/spark/pull/34602#issuecomment-970056589


   > `OptimizeSkewedJoin` is supposed to only handle materialized shuffle 
stages, or did I miss something?
   
   I haven't changed that. The problem is since `OptimizeSkewedJoin` was moved 
from `queryStageOptimizerRules` to `queryStagePreparationRules`,  
`OptimizeSkewedJoin` was applied on the whole plan.
   For example 
   ```
   +- ShuffledHashJoin [value2#227L], [value3#233L], Inner
      :  +- Exchange hashpartitioning(value2#227L, 10), ENSURE_REQUIREMENTS, 
[id=#260]
      :     +- Project [key1#220L, value2#227L]
      :        +- ShuffledHashJoin [key1#220L], [key2#226L], Inner
      :           :  +- ShuffleQueryStage 0
      :              +- ShuffleQueryStage 1
         +- ShuffleQueryStage 2
   ```
   We should apply `OptimizeSkewedJoin` on `Project [key1#220L, value2#227L]` 
rather the whole plan `ShuffledHashJoin [value2#227L], [value3#233L], Inner`
   So we need to find top plan of new stage is about to submit, which will be 
`Project [key1#220L, value2#227L]` only if ShuffleQueryStage0 and 
ShuffleQueryStage1 are materialized.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to