swuferhong commented on issue #2405:
URL: https://github.com/apache/fluss/issues/2405#issuecomment-3817218540

   > Managed to reproduce this. It turns out to be a race condition that 
triggers under high load (verified with 50 tables / 400 buckets).
   > 
   > The root cause is that `CoordinatorEventProcessor` uses a 
"fire-and-forget" approach for `StopReplica` requests during rebalance. It 
marks the task as `COMPLETED` immediately after sending the RPC. If the 
`TabletServer` is backed up, it hasn't actually processed the stop request or 
resigned leadership yet, even though the client sees the task as finished.
   > 
   > I've got a fix working that
   > 
   > * Updates `RebalanceManager` to explicitly track pending deletions.
   > * Changes `CoordinatorEventProcessor` to wait for the 
`DeleteReplicaResponseReceivedEvent` callback before marking the task as 
completed.
   > * Adds logic to handle `DeadTabletServerEvent` so the rebalance task 
doesn't hang if a server dies while waiting for the response.
   > 
   > Verified the fix passes the reproduction test.
   > 
   > [@swuferhong](https://github.com/swuferhong) could you please assign this 
issue to me? I will provide the PR shortly
   
   Done!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to