RexXiong commented on PR #2532:
URL: https://github.com/apache/celeborn/pull/2532#issuecomment-2229898570

   > > Hi @wangshengjie123, thanks for fixing this! for referrence, I tested a 
different fine grined version 
([here](https://github.com/YutingWang98/celeborn/pull/1)) without the mentioned 
in this 
[comment](https://github.com/apache/celeborn/pull/2532#discussion_r1623738476). 
However, we are still seeing data loss, and the loss can't be reproduced 
everytime. Wondering if there is still any concurrency issues you can see in 
the forked PR 
[YutingWang98#1](https://github.com/YutingWang98/celeborn/pull/1)? thanks in 
advance!
   > 
   > I am sure there is still hit concurrency issues.
   > 
   > * firstly, agree with dont put empty set to `    val requests = 
changePartitionRequests.computeIfAbsent(shuffleId, rpcContextRegisterFunc)`, so 
dont need to deduplicate revive request with same epoch and check in 
`handleRequestPartitions`
   > * Concurrency issues are due to lock granularity. Just one question, why 
data lost, i think same revive request be handled.
   >   There is still an concurrency issue for `requestSet` in [this 
condition](https://github.com/YutingWang98/celeborn/blob/020c3b4b36614551fdb165c93024d9bba5a54c76/client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala#L92C1-L93C1),
 partitionId may be removed and same revive request will be handled more than 
one time.
   
   May be data lost due to `https://github.com/apache/celeborn/pull/2621`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to