wangshengjie123 commented on PR #2532: URL: https://github.com/apache/celeborn/pull/2532#issuecomment-2205304132
> Hi @wangshengjie123, thanks for fixing this! for referrence, I tested a different fine grined version ([here](https://github.com/YutingWang98/celeborn/pull/1)) without the mentioned in this [comment](https://github.com/apache/celeborn/pull/2532#discussion_r1623738476). However, we are still seeing data loss, and the loss can't be reproduced everytime. Wondering if there is still any concurrency issues you can see in the forked PR [YutingWang98#1](https://github.com/YutingWang98/celeborn/pull/1)? thanks in advance! I am sure there is still hit concurrency issues. - firstly, agree with dont put empty set to ` val requests = changePartitionRequests.computeIfAbsent(shuffleId, rpcContextRegisterFunc)`, so dont need to deduplicate revive request with same epoch and check in `handleRequestPartitions` - Concurrency issues are due to lock granularity. Just one question, why data lost, i think same revive request be handled. There is still an concurrency issue for `requestSet` in [this condition](https://github.com/YutingWang98/celeborn/blob/020c3b4b36614551fdb165c93024d9bba5a54c76/client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala#L92C1-L93C1), partitionId may be removed and same revive request will be handled more than one time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
