swuferhong commented on issue #2405: URL: https://github.com/apache/fluss/issues/2405#issuecomment-3817218540
> Managed to reproduce this. It turns out to be a race condition that triggers under high load (verified with 50 tables / 400 buckets). > > The root cause is that `CoordinatorEventProcessor` uses a "fire-and-forget" approach for `StopReplica` requests during rebalance. It marks the task as `COMPLETED` immediately after sending the RPC. If the `TabletServer` is backed up, it hasn't actually processed the stop request or resigned leadership yet, even though the client sees the task as finished. > > I've got a fix working that > > * Updates `RebalanceManager` to explicitly track pending deletions. > * Changes `CoordinatorEventProcessor` to wait for the `DeleteReplicaResponseReceivedEvent` callback before marking the task as completed. > * Adds logic to handle `DeadTabletServerEvent` so the rebalance task doesn't hang if a server dies while waiting for the response. > > Verified the fix passes the reproduction test. > > [@swuferhong](https://github.com/swuferhong) could you please assign this issue to me? I will provide the PR shortly Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
