leixm opened a new pull request, #3490:
URL: https://github.com/apache/celeborn/pull/3490

   ### What changes were proposed in this pull request?
   In the dual-replica scenario, when creating a reader, we should select the 
replica based on taskAttemptId. Usually, taskAttempt0 selects primary 
partitionLocation, task Attempt1 selects replica partitionLocation, and so on. 
This will provide better fault tolerance.
   
   
   ### Why are the changes needed?
   Since https://github.com/apache/celeborn/pull/3079, we deleted the code 
logic which should use replica data when task attempt is odd, but if the data 
of primary partitionLocation is corrupted and CelebornInputStream#fillBuffer 
throws exception, such as decompression failure or some other exceptions, the 
replica prititionLocation will not be used when the task is retried. In fact, 
if taskAttempt1 uses the replica partitionLocation, taskAttempt1 can run 
successfully.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Existing UTs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to