advancedxy opened a new issue, #477:
URL: https://github.com/apache/incubator-uniffle/issues/477

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Describe the feature
   
   Once `Uniffle`'s shuffle data read fails, the spark client has its chance to 
recompute the whole stage.
   
   ### Motivation
   
   In a distributed cluster with large number of shuffler servers, it's common 
for some node to go down, such as:
   1. node crash/maintenance due to hardware failures or security patches
   2. Pod eviction if deployed in a k8s environment
   3. vm/spot instance eviction if deployed in a cloud environment
   
   Uniffle already has provide a mechanism to overcome this issue: the quorum 
protocol. But it requires multiple replica of the same shuffle data, which 
increases the network traffic and memory pressure on shuffle server.  And the 
E2E performance may be degraded due to the replication.
   
   I'd like to provide a new way to alleviate the potential node failure(in 
rare chance). Once the whole stage could be recompute, the Spark App could be 
resilient to shuffle server node failure.
   
   ### Describe the solution
   
   TBD.
   
   A design doc would be added later
   
   ### Additional context
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to