yl09099 opened a new pull request, #1138:
URL: https://github.com/apache/incubator-uniffle/pull/1138

   ### What changes were proposed in this pull request?
   
   The ShuffleServer corresponding to the block that failed to be sent needs to 
be reported.
   
   Ⅰ. Overall objective:
   
   1. During the shuffle write phase, the ShuffleServer reports faulty nodes 
and reallocates the ShuffleServer list;
   2. Triggers a Stage level retry of SPARK. The shuffleServer node is excluded 
and reallocated before the retry.
   
   Ⅱ. Implementation logic diagram:
   
   
![image](https://github.com/apache/incubator-uniffle/assets/33595968/866c8292-e0ff-4532-b519-02f424f4c2fc)
   
   Ⅲ. As shown in the picture above:
   
   1. During Shuffle registration, obtain the ShuffleServer list to be written 
through the RPC interface of a Coordinator Client by following the solid blue 
line step. The list is bound using ShuffleID.
   2, the Task of Stage starts, solid steps, in accordance with the green by 
ShuffleManager Client RPC interface gets to be written for 
shuffleIdToShuffleHandleInfo ShuffleServer list;
   3. In the Stage, if Task fails to write blocks to the ShuffleServer, press 
the steps in red to report ShuffleServer to FailedShuffleServerList in 
RSSShuffleManager through the RPC interface.
   4. FailedShuffleServerList records the number of ShuffleServer failures. 
After the number of failures reaches the maximum number of retries of the Task 
level, follow the steps in dotted orange lines. Through the RPC interface of a 
Coordinator Client, obtain the list of ShuffleServer files to be written (the 
ShuffleServer files that fail to be written are excluded). After obtaining the 
list, go to Step 5 of the dotted orange line. Throwing a FetchFailed Exception 
triggers a stage-level retry for SPARK;
   5. Attempt 1 is generated by the SPARK Stage level again. Pull the 
corresponding ShuffleServer list according to the green dotted line.
   
   ### Why are the changes needed?
   
   Reports the ShuffleServer corresponding to the block that failed to be sent
   
   Fix: #825 
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to