ekoifman commented on pull request #30829:
URL: https://github.com/apache/spark/pull/30829#issuecomment-767954899


   RE 1, I see this comment on `OptimizeSkewedJoin` but I don't see the code 
actually doing any coalesce.  `optimizeSkewJoin()` uses `ShuffleStage.unapply` 
to either take `CoalescedPartitionSpec` from a `CustomShuffleReaderExec` child 
node or creates `CoalescedPartitionSpec` if the child is 
`ShuffleQueryStageExec`.  In the latter case, all the specs created are 
`CoalescedPartitionSpec(i, i + 1)`.  As far as I can tell, the rest of 
`optimizeSkewJoin()` either handles a skewed partition or uses 
`CoalescedPartitionSpec` from `ShuffleStageInfo`.  Is there some other place 
`OptimizeSkewedJoin` performs coalesce?
   
   Also, I'd like to clarify one thing in the in the current PR.  Stage 
optimization doesn't actually add the shuffle.  `OptimizeSkewedJoin` in the 
stage optimization runs and returns the modified plan with the new shuffle in 
`newQueryStage(e)` but this plan is not used.   Once this happens, recursive 
call stack of `createQueryStages()` stops and control goes back to the top of 
`currentPhysicalPlan` in the main `while (!result.allChildStagesMaterialized) 
{` of `getFinalPhysicalPlan`.  This is done via `ShuffleAddedException ` which 
serves as a "request" to add a shuffle.  So with respect to 
`createQueryStages()`, the insertion of the shuffle is no different than if the 
new shuffle was caused by something in `reOptimize(logicalPlan)`.  
   
   So in fact, the new shuffle is added during Query Stage preparation as you 
suggest,  but it looks different in the code because `OptimizeSkewedJoin` 
decides where the new shuffle should be added, but it asks higher logic to add 
it.  In some sense this can be thought of as search space exploration with 
backtracking: query stage optimization  (specifically `OptimizeSkewedJoin`) 
says I can't do what you want unless you give me a plan with a new shuffle in 
it and causes query stage creation to backtrack, add the shuffle and restart 
query stage creation/optimization on the new `currentPhysicalPlan` with the new 
shuffle.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to