wangshengjie123 commented on PR #2373:
URL: https://github.com/apache/celeborn/pull/2373#issuecomment-2508876989

   Hello everyone, sorry for the long delay in updating this PR. Compared to 
the previous version, I have made the following changes:
   - When enabling stage rerun, record the Celeborn skew join shuffle id in the 
Spark MapOutputTracker.
   - Added logic in the DAGScheduler to prevent retries if the fetch failed 
shuffle is a recorded skew shuffle id.
   
   According to the @RexXiong 's suggestion,after this PR is merged, I will 
submit a new PR to fully support stage rerun,the following are my ideas:
   - In the DAGScheduler, iterate over the fetch failed skew shuffle id and 
their downstream shuffle ids. When this ShuffleMapStage is resubmitted, 
unregister the map status of this shuffle and retry all tasks.
   - In Celeborn, register a function callback in the LifecycleManager. This 
function should be called when handleGetShuffleId. The function will 
reflectively call the DAGScheduler to check if the recorded skew shuffle ids 
and their downstream shuffle ids include the current shuffle id:
     - If it includes the current shuffle id, then this shuffle must request a 
new shuffle ID to prevent data errors.
     - Otherwise, reuse the previous shuffle id.
   
   @waitinfuture @RexXiong @mridulm @cfmcgrady @s0nskar @pan3793 , please take 
some time to review this PR, thanks all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to