yl09099 opened a new pull request, #1138: URL: https://github.com/apache/incubator-uniffle/pull/1138
### What changes were proposed in this pull request? The ShuffleServer corresponding to the block that failed to be sent needs to be reported. Ⅰ. Overall objective: 1. During the shuffle write phase, the ShuffleServer reports faulty nodes and reallocates the ShuffleServer list; 2. Triggers a Stage level retry of SPARK. The shuffleServer node is excluded and reallocated before the retry. Ⅱ. Implementation logic diagram:  Ⅲ. As shown in the picture above: 1. During Shuffle registration, obtain the ShuffleServer list to be written through the RPC interface of a Coordinator Client by following the solid blue line step. The list is bound using ShuffleID. 2, the Task of Stage starts, solid steps, in accordance with the green by ShuffleManager Client RPC interface gets to be written for shuffleIdToShuffleHandleInfo ShuffleServer list; 3. In the Stage, if Task fails to write blocks to the ShuffleServer, press the steps in red to report ShuffleServer to FailedShuffleServerList in RSSShuffleManager through the RPC interface. 4. FailedShuffleServerList records the number of ShuffleServer failures. After the number of failures reaches the maximum number of retries of the Task level, follow the steps in dotted orange lines. Through the RPC interface of a Coordinator Client, obtain the list of ShuffleServer files to be written (the ShuffleServer files that fail to be written are excluded). After obtaining the list, go to Step 5 of the dotted orange line. Throwing a FetchFailed Exception triggers a stage-level retry for SPARK; 5. Attempt 1 is generated by the SPARK Stage level again. Pull the corresponding ShuffleServer list according to the green dotted line. ### Why are the changes needed? Reports the ShuffleServer corresponding to the block that failed to be sent Fix: #825 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
