ZuebeyirEser commented on issue #2405:
URL: https://github.com/apache/fluss/issues/2405#issuecomment-3816813002

   Managed to reproduce this. It turns out to be a race condition that triggers 
under high load (verified with 50 tables / 400 buckets).
   
   The root cause is that `CoordinatorEventProcessor` uses a "fire-and-forget" 
approach for `StopReplica` requests during rebalance. It marks the task as 
`COMPLETED` immediately after sending the RPC. If the `TabletServer` is backed 
up, it hasn't actually processed the stop request or resigned leadership yet, 
even though the client sees the task as finished.
   
   I've got a fix working that
   - Updates `RebalanceManager` to explicitly track pending deletions.
   - Changes `CoordinatorEventProcessor` to wait for the 
`DeleteReplicaResponseReceivedEvent` callback before marking the task as 
completed.
   - Adds logic to handle `DeadTabletServerEvent` so the rebalance task doesn't 
hang if a server dies while waiting for the response.
   
   Verified the fix passes the reproduction test. 
   
   @swuferhong could you please assign this issue to me? I will provide the PR 
shortly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to