RaulGracia opened a new pull request #2761:
URL: https://github.com/apache/bookkeeper/pull/2761


   ### Motivation
   
   Added `TCP_USER_TIMEOUT` in Epoll channel config to limit the time a 
connection is left sending keepalives to a non-responding Bookie.
   
   ### Changes
   
   The original issue reported that in scenarios where Bookies may go down 
unexpectedly and change their IP (e.g., Kubernetes), the Bookkeeper client may 
be left for some time attempting to connect with the old IP of the restarted 
Bookie (see #2482 for details). To prevent this problem from happening (in 
Epoll channels), we introduce the following changes:
   - Epoll channels are now configured with `TCP_USER_TIMEOUT`. This parameter 
rules over the underlying TCP keepalive configuration (see 
https://datatracker.ietf.org/doc/html/rfc5482), which may be defaulted to retry 
for too long depending on the environment (e.g., 10-15 minutes in our 
experience).
   - To prevent adding more configuration parameters, the existing 
`clientConnectTimeoutMillis` value in `ClientConfiguration` is the one used to 
set `TCP_USER_TIMEOUT` due to its similarity.
   
   ### Validation
   
   We have reproduced the original testing environment in which this problem 
appears consistently:
   - Cluster with 4 Bookies and 3 Kubernetes nodes, in addition to 
https://pravega.io which uses the Bookkeeper client.
   - Deployed an application to do IO to Pravega (and therefore, to Bookkeeper).
   - Periodically shut down a Kubernetes node, so Bookkeeper pods on it are 
restarted as well.
   
   Considering this test procedure, without the proposed PR we consistently 
observe Bookkeeper clients getting stuck trying to contact with old IPs from 
Bookies. With this change, we confirmed via logs that the configuration change 
takes place and we have not been able to reproduce the original problem so far 
after performing multiple node reboots.
   
   Master Issue: #2482
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to