[GitHub] [spark] ekoifman commented on pull request #31653: [SPARK-33832][SQL] v2. move OptimzieSkewedJoin to query stage preparation

GitBox Thu, 18 Mar 2021 09:37:41 -0700


ekoifman commented on pull request #31653:
URL: https://github.com/apache/spark/pull/31653#issuecomment-802094504



   `queryStagePreparationRules` are run in `reOptimize` and `CostEvaluator` 
runs after `reOptimize` so I don't understand why you say the costing won't see 
the shuffles added by `OptimizeSkewedJoin` if it runs as part of stage prep....
   
   `OptimizeSkewedJoin` does run `EnsureRequirements` and decides whether to 
it's OK to add shuffles based on `conf.adaptiveForceIfShuffle`.  It does not 
rely on `CostEvaluator` for this.  The issue is what happens once it 
`OptimizeSkewedJoin` adds the shuffles.  These shuffles are indeed visible to 
`CostEvaluator` since the costing is done before these shuffles can be hidden 
by `createQueryStages`.
   
   In current master, `CostEvaluator` runs over the plan after `reOptimize` and 
rejects it if new shuffles were added.  One possible reason for these new 
shuffles is given by @maryannxue in 
https://github.com/apache/spark/pull/30829#issuecomment-773444471.  These 
shuffles have nothing to do with `OptimzieSkewedJoin`.
   
   Since in this PR `OptimzeSkewedJoin` runs during stage preparation, i.e. in 
`reOptimize`, `CostEvaluator` will see new shuffles added by `OptimizeSkewJoin` 
(and any shuffles added for any other reason) and reject the new plan unless it 
knows to treat new shuffles added by `OptimizeSkewedJoin` differently - this is 
the suggestion from @maryannxue in 
https://github.com/apache/spark/pull/30829#issuecomment-773444471.
   
   It may be much easier to discuss this in real time.  Would you be willing to 
get on a short Zoom call or whatever is convenient?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ekoifman commented on pull request #31653: [SPARK-33832][SQL] v2. move OptimzieSkewedJoin to query stage preparation

Reply via email to