jonminter-dojo opened a new issue, #68396:
URL: https://github.com/apache/airflow/issues/68396

   ### Under which category would you file this issue?
   
   Providers
   
   ### Apache Airflow version
   
   3.2.1
   
   ### What happened and how to reproduce it?
   
   **Issue Description**
   Noticed while debugging an issue with out schedulers crash looping because 
of K8s liveness probes failing that connections opened to the Kubernetes API 
server were getting stuck in a `CLOSE_WAIT` state. Meaning next time the 
connection was pulled from the connection pool it would be unable to make API 
calls successfully. These connections were never removed from the connection 
pool and remained in the pool in that state.
   
   This ended up being a red herring for what was causing the crash-looping 
though.
   
   There was high DB load on our metadata DB and this presumably caused 
scheduler loop to slow down which in turn caused delay between k8s API calls 
giving an opportunity for the connections in the k8s API connection pool to be 
dropped by the NLB in front of the k8s API server. Once we fixed the issue with 
the metadata DB the issue with the k8s api connection pool and `CLOSE_WAIT` 
state connections resolved itself.
   
   But this issue with keep alives could affect anyone that has a load balancer 
in front of the K8s API server as is common in cloud environments like 
AWS/GCP/Azure where a connection timeout is configured.
   
   Digging into the code for the k8s provider it appears the mechanism used to 
enable TCP keepalives no longer works with urllib3 v2.x.
   
   It relies on new HTTPConnection/HTTPSConnection using 
HTTPConnection.default_socket_options as a default when no socket_option 
argument is provided. But in urllib3 v2.x the HTTPSConnection/HTTPConnection 
__init__ function sets socket_options as a default argument value when not 
provided meaning it gets evaluated at module import time. Since 
`_enable_tcp_keepalive` overwrites `HTTPConnection.default_socket_options` the 
changes are never used when creating new `HTTPConnection` objects.
   
   I also wonder if instead of trying to patch `default_socket_options` would 
it make more sense to use the `Configuration` object from the kubernetes client 
to add socket options:
   
   
https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/aio/client/configuration.py#L230
   
   This can be passed to `ApiClient`:
   
   
https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/client/api_client.py#L68
   
   Which is passed to `RESTClientObject`:
   
   
https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/client/rest.py#L75
   
   And the socket options are provided to urllib3 `PoolManager` if set:
   
   
https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/client/rest.py#L105
   
   And that is passed to the underlying `HTTPConnectionPool`:
   
   
https://github.com/urllib3/urllib3/blob/e8a81de5e1fe6b52dbcbdd93345560097c1e58fb/src/urllib3/poolmanager.py#L261
   
   And should be passing into the connection itself:
   
   
https://github.com/urllib3/urllib3/blob/e8a81de5e1fe6b52dbcbdd93345560097c1e58fb/src/urllib3/connectionpool.py#L252
   
   And `HTTPConnection.__init__` has a `socket_options` argument:
   
   
https://github.com/urllib3/urllib3/blob/e8a81de5e1fe6b52dbcbdd93345560097c1e58fb/src/urllib3/connection.py#L140
   
   I haven't actually verified this works but it seems like you could thread 
the socket options through the client config vs trying to patch default options.
   
   **Steps to Reproduce**
   
   See 
https://github.com/jonminter-dojo/airflow-k8s-tcp-keepalive-repro/blob/main/main.py
 which has a minimal reproduction using `_enable_tcp_keepalive` and 
`get_kube_client` from `airflow.providers.cncf.kubernetes.kube_client` to 
demonstrate the issue.
   
   ### What you think should happen instead?
   
   When `enable_tcp_keepalive` is enabled the TCP keep alive socket options 
should be set on new connections opened for the kubernetes api client.
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   10.16.0
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   Not Applicable
   
   ### Helm Chart configuration
   
   Not Applicable
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to