[I] [Server] AdjustIsrManager RPC-level exception handling causes unsentMap leak [fluss]

via GitHub Thu, 11 Dec 2025 23:05:33 -0800


platinumhamburg opened a new issue, #2160:
URL: https://github.com/apache/fluss/issues/2160


   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Fluss version
   
   0.8.0 (latest release)
   
   ### Please describe the bug 🐞
   
   ## Description
   `AdjustIsrManager` is responsible for sending ISR (In-Sync Replica) 
adjustment requests to the coordinator. The current implementation only handles 
bucket-level graceful exceptions within the request/response processing logic. 
However, **top-level RPC exceptions** (such as `NetworkException`, connection 
failures, or serialization errors) are either:
   
   1. **Silently swallowed** and deferred for retry without proper cleanup, or
   2. **Not caught at all**, leaving the internal state inconsistent.
   
   ### Impact
   When these top-level exceptions occur and are not properly handled, the 
affected `TableBucket` entries in the internal `unsentMap` are **never 
removed**. This leads to:
   
   - **Permanent blockage**: All subsequent `submit(...)` calls for the same 
`TableBucket` are immediately rejected (typically throwing 
`OperationNotAttemptedException` or similar), because the manager believes a 
previous request is still in-flight.
   - **ISR adjustment stalls**: Replicas cannot be added or removed from the 
ISR, preventing the cluster from recovering from failures or scaling events.
   
   ### Solution
   
   1. **Wrap RPC calls with exception handling**: Ensure all top-level 
exceptions (network, serialization, etc.) are caught during the 
`sendAdjustIsrRequest` flow.
   2. **Clean up unsentMap on failure**: On catching a top-level exception, 
remove the corresponding `TableBucket` entry from `unsentMap` to allow 
subsequent retries.
   3. **Propagate exception to caller**: Complete the returned 
`CompletableFuture` exceptionally with the caught exception, so the caller 
(e.g., `ReplicaManager`) is aware of the failure and can decide on retry logic.
   4. **Add test coverage**: Introduce a test case that simulates RPC-level 
failures (e.g., via `TestCoordinatorGateway` network issues) and verifies:
      - The exception is propagated to the future.
      - A subsequent `submit(...)` is not blocked after the network issue is 
resolved.
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Server] AdjustIsrManager RPC-level exception handling causes unsentMap leak [fluss]

Reply via email to