platinumhamburg opened a new issue, #2160: URL: https://github.com/apache/fluss/issues/2160
### Search before asking - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar. ### Fluss version 0.8.0 (latest release) ### Please describe the bug 🐞 ## Description `AdjustIsrManager` is responsible for sending ISR (In-Sync Replica) adjustment requests to the coordinator. The current implementation only handles bucket-level graceful exceptions within the request/response processing logic. However, **top-level RPC exceptions** (such as `NetworkException`, connection failures, or serialization errors) are either: 1. **Silently swallowed** and deferred for retry without proper cleanup, or 2. **Not caught at all**, leaving the internal state inconsistent. ### Impact When these top-level exceptions occur and are not properly handled, the affected `TableBucket` entries in the internal `unsentMap` are **never removed**. This leads to: - **Permanent blockage**: All subsequent `submit(...)` calls for the same `TableBucket` are immediately rejected (typically throwing `OperationNotAttemptedException` or similar), because the manager believes a previous request is still in-flight. - **ISR adjustment stalls**: Replicas cannot be added or removed from the ISR, preventing the cluster from recovering from failures or scaling events. ### Solution 1. **Wrap RPC calls with exception handling**: Ensure all top-level exceptions (network, serialization, etc.) are caught during the `sendAdjustIsrRequest` flow. 2. **Clean up unsentMap on failure**: On catching a top-level exception, remove the corresponding `TableBucket` entry from `unsentMap` to allow subsequent retries. 3. **Propagate exception to caller**: Complete the returned `CompletableFuture` exceptionally with the caught exception, so the caller (e.g., `ReplicaManager`) is aware of the failure and can decide on retry logic. 4. **Add test coverage**: Introduce a test case that simulates RPC-level failures (e.g., via `TestCoordinatorGateway` network issues) and verifies: - The exception is propagated to the future. - A subsequent `submit(...)` is not blocked after the network issue is resolved. ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
