[I] SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading [fluss]

via GitHub Wed, 03 Dec 2025 18:13:40 -0800


swuferhong opened a new issue, #2097:
URL: https://github.com/apache/fluss/issues/2097


   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/fluss/issues) and 
found nothing similar.
   
   
   ### Fluss version
   
   0.8.0 (latest release)
   
   ### Please describe the bug 🐞
   
   There will be one situation that SeverNode info of all tabletServers maybe 
invalid in client metadata when cluster upgrading, which will cause the 
`write/read/lookup` operation block forever. Image this case:
   
   There is a `Fluss` cluster with 3 `TabletServers` and a Fluss write job 
running with high parallelism. The cluster undergoes a rolling upgrade under 
the following conditions:
   
   1. Pods are not upgraded in-place—their IP addresses change after restart.
   2. Fluss networking has no built-in timeout mechanism and relies solely on 
the Netty client’s connection timeout. If a request is sent to a disconnected 
server, the client will wait indefinitely for the server’s response (sync 
acknowledgment) until the Netty timeout of 120 seconds is reached.
   
   Below is the failure scenario:
   
   1) **Initial state**:  
     ts-0 → 192.108.0.1  
     ts-1 → 192.108.0.2  
     ts-2 → 192.108.0.3  
   
   2) **Upgrade starts**: ts-0 becomes unreachable. The client attempts to 
`updateMetadata` by sending the request to ts-0, fails to connect, and waits 
120 seconds.
   
   3) **ts-0 restarts** with a new IP: 192.108.0.4.
   
   4) The client retries `updateMetadata`, again targeting ts-0 (based on stale 
metadata), and waits another 120 seconds. Meanwhile, ts-1 finishes its upgrade 
and gets a new IP: 192.108.0.5.
   
   5) Another `updateMetadata` attempt is made—possibly to ts-0 or ts-1 (both 
now with outdated IPs in the client’s cache)—and the client waits yet another 
120 seconds. At this point, ts-2 also completes its upgrade and changes its IP 
to 192.108.0.6.
   
   6) **After this**, all TabletServer IPs in the client’s metadata cache are 
stale. No matter which server the client tries to contact, it uses an incorrect 
IP, causing all subsequent requests to time out. As a result, the job cannot 
recover automatically.
   
   ### Solution
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading [fluss]

Reply via email to