ekoifman commented on pull request #31653:
URL: https://github.com/apache/spark/pull/31653#issuecomment-797676092


   @cloud-fan thanks for the feedback.
   I agree about the existing workflow.  This PR does exactly what you say: it 
moves `OptimizeSkewedJoin` from `queryStageOptimizerRules` to query stage 
preparation rules.
   
   `EnsureRequirements` is run inside `OptimizeSkewedJoin` - see 
`OptimizeSkewedJoin.handlePotentialQueryStage()`.  And
   if `conf.adaptiveForceIfShuffle` then the new plan with new shuffles is 
returned otherwise the original plan is returned.
   
   I've split the `queryStagePreparationRules` into 2 parts: 
`queryStagePreparationRules` and `queryStagePreparationRules2`.  The two parts 
run one after another in `reOptimize()` so the combined effect is just like 
adding `OptimizeSkewedJoin` to  `queryStagePreparationRules` list. 
   
   The reason I did this is that after `reOptimize`, the logic checks if new 
shuffles were added and I need to know which ones were added by 
`OptimizeSkewedJoin` and which were added by something else.  This is based on 
suggestion from @maryannxue in 
https://github.com/apache/spark/pull/30829#issuecomment-773444471.  I could've 
instead put `OptimizeSkewedJoin` into `queryStageOptimizerRules` directly but 
I'd have to mark the shuffles added by `OptimizeSkewedJoin` somehow so that the 
cost formula applied after `reOptimize` could treat them differently.  
   
   So in this PR `OptimizeSkewedJoin` is run before `createQueryStages` (and 
thus before `queryStageOptimizerRules`), i.e. during query stage preparation.
   
   Moving `OptimizeSkewedJoin` made `OptimizeSkewedJoin` itself a lot more 
complicated.  Previously it ran on part fo the whole query tree bounded by 
Exchange above and materialized `QueryStageExec`s below, i.e. what is to become 
the new query stage.  But query stage preparation rules run from the top of the 
plan to the leaves, i.e. `QueryStageExec`s, so it may encounter any number of 
Exchange nodes that may become the root of a new QueryStage.
   
   Please let me know if this clarifies it or not.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to