ekoifman commented on pull request #31653: URL: https://github.com/apache/spark/pull/31653#issuecomment-802094504
`queryStagePreparationRules` are run in `reOptimize` and `CostEvaluator` runs after `reOptimize` so I don't understand why you say the costing won't see the shuffles added by `OptimizeSkewedJoin` if it runs as part of stage prep.... `OptimizeSkewedJoin` does run `EnsureRequirements` and decides whether to it's OK to add shuffles based on `conf.adaptiveForceIfShuffle`. It does not rely on `CostEvaluator` for this. The issue is what happens once it `OptimizeSkewedJoin` adds the shuffles. These shuffles are indeed visible to `CostEvaluator` since the costing is done before these shuffles can be hidden by `createQueryStages`. In current master, `CostEvaluator` runs over the plan after `reOptimize` and rejects it if new shuffles were added. One possible reason for these new shuffles is given by @maryannxue in https://github.com/apache/spark/pull/30829#issuecomment-773444471. These shuffles have nothing to do with `OptimzieSkewedJoin`. Since in this PR `OptimzeSkewedJoin` runs during stage preparation, i.e. in `reOptimize`, `CostEvaluator` will see new shuffles added by `OptimizeSkewJoin` (and any shuffles added for any other reason) and reject the new plan unless it knows to treat new shuffles added by `OptimizeSkewedJoin` differently - this is the suggestion from @maryannxue in https://github.com/apache/spark/pull/30829#issuecomment-773444471. It may be much easier to discuss this in real time. Would you be willing to get on a short Zoom call or whatever is convenient? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
