ekoifman commented on pull request #30829:
URL: https://github.com/apache/spark/pull/30829#issuecomment-767954899
RE 1, I see this comment on `OptimizeSkewedJoin` but I don't see the code
actually doing any coalesce. `optimizeSkewJoin()` uses `ShuffleStage.unapply`
to either take `CoalescedPartitionSpec` from a `CustomShuffleReaderExec` child
node or creates `CoalescedPartitionSpec` if the child is
`ShuffleQueryStageExec`. In the latter case, all the specs created are
`CoalescedPartitionSpec(i, i + 1)`. As far as I can tell, the rest of
`optimizeSkewJoin()` either handles a skewed partition or uses
`CoalescedPartitionSpec` from `ShuffleStageInfo`. Is there some other place
`OptimizeSkewedJoin` performs coalesce?
Also, I'd like to clarify one thing in the in the current PR. Stage
optimization doesn't actually add the shuffle. `OptimizeSkewedJoin` in the
stage optimization runs and returns the modified plan with the new shuffle in
`newQueryStage(e)` but this plan is not used. Once this happens, recursive
call stack of `createQueryStages()` stops and control goes back to the top of
`currentPhysicalPlan` in the main `while (!result.allChildStagesMaterialized)
{` of `getFinalPhysicalPlan`. This is done via `ShuffleAddedException ` which
serves as a "request" to add a shuffle. So with respect to
`createQueryStages()`, the insertion of the shuffle is no different than if the
new shuffle was caused by something in `reOptimize(logicalPlan)`.
So in fact, the new shuffle is added during Query Stage preparation as you
suggest, but it looks different in the code because `OptimizeSkewedJoin`
decides where the new shuffle should be added, but it asks higher logic to add
it. In some sense this can be thought of as search space exploration with
backtracking: query stage optimization (specifically `OptimizeSkewedJoin`)
says I can't do what you want unless you give me a plan with a new shuffle in
it and causes query stage creation to backtrack, add the shuffle and restart
query stage creation/optimization on the new `currentPhysicalPlan` with the new
shuffle.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]