Re: [PR] [CELEBORN-1439] Fix revive logic bug which will casue data correctness issue and job failiure [celeborn]

via GitHub Wed, 03 Jul 2024 00:38:39 -0700


wangshengjie123 commented on PR #2532:
URL: https://github.com/apache/celeborn/pull/2532#issuecomment-2205304132


   > Hi @wangshengjie123, thanks for fixing this! for referrence, I tested a 
different fine grined version 
([here](https://github.com/YutingWang98/celeborn/pull/1)) without the mentioned 
in this 
[comment](https://github.com/apache/celeborn/pull/2532#discussion_r1623738476). 
However, we are still seeing data loss, and the loss can't be reproduced 
everytime. Wondering if there is still any concurrency issues you can see in 
the forked PR 
[YutingWang98#1](https://github.com/YutingWang98/celeborn/pull/1)? thanks in 
advance!
   
   I am sure there is still hit concurrency issues. 
   - firstly, agree with dont put empty set to `    val requests = 
changePartitionRequests.computeIfAbsent(shuffleId, rpcContextRegisterFunc)`, so 
dont need to deduplicate revive request with same epoch and check in 
`handleRequestPartitions`
   - Concurrency issues are due to lock granularity. Just one question, why 
data lost, i think same revive request be handled. 
   There is still an concurrency issue for `requestSet` in [this 
condition](https://github.com/YutingWang98/celeborn/blob/020c3b4b36614551fdb165c93024d9bba5a54c76/client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala#L92C1-L93C1),
 partitionId may be removed and same revive request will be handled more than 
one time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-1439] Fix revive logic bug which will casue data correctness issue and job failiure [celeborn]

Reply via email to