jonminter-dojo opened a new issue, #68396: URL: https://github.com/apache/airflow/issues/68396
### Under which category would you file this issue? Providers ### Apache Airflow version 3.2.1 ### What happened and how to reproduce it? **Issue Description** Noticed while debugging an issue with out schedulers crash looping because of K8s liveness probes failing that connections opened to the Kubernetes API server were getting stuck in a `CLOSE_WAIT` state. Meaning next time the connection was pulled from the connection pool it would be unable to make API calls successfully. These connections were never removed from the connection pool and remained in the pool in that state. This ended up being a red herring for what was causing the crash-looping though. There was high DB load on our metadata DB and this presumably caused scheduler loop to slow down which in turn caused delay between k8s API calls giving an opportunity for the connections in the k8s API connection pool to be dropped by the NLB in front of the k8s API server. Once we fixed the issue with the metadata DB the issue with the k8s api connection pool and `CLOSE_WAIT` state connections resolved itself. But this issue with keep alives could affect anyone that has a load balancer in front of the K8s API server as is common in cloud environments like AWS/GCP/Azure where a connection timeout is configured. Digging into the code for the k8s provider it appears the mechanism used to enable TCP keepalives no longer works with urllib3 v2.x. It relies on new HTTPConnection/HTTPSConnection using HTTPConnection.default_socket_options as a default when no socket_option argument is provided. But in urllib3 v2.x the HTTPSConnection/HTTPConnection __init__ function sets socket_options as a default argument value when not provided meaning it gets evaluated at module import time. Since `_enable_tcp_keepalive` overwrites `HTTPConnection.default_socket_options` the changes are never used when creating new `HTTPConnection` objects. I also wonder if instead of trying to patch `default_socket_options` would it make more sense to use the `Configuration` object from the kubernetes client to add socket options: https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/aio/client/configuration.py#L230 This can be passed to `ApiClient`: https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/client/api_client.py#L68 Which is passed to `RESTClientObject`: https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/client/rest.py#L75 And the socket options are provided to urllib3 `PoolManager` if set: https://github.com/kubernetes-client/python/blob/f8c3ba2a7be3c90f2e94ea1f655f8a589c642333/kubernetes/client/rest.py#L105 And that is passed to the underlying `HTTPConnectionPool`: https://github.com/urllib3/urllib3/blob/e8a81de5e1fe6b52dbcbdd93345560097c1e58fb/src/urllib3/poolmanager.py#L261 And should be passing into the connection itself: https://github.com/urllib3/urllib3/blob/e8a81de5e1fe6b52dbcbdd93345560097c1e58fb/src/urllib3/connectionpool.py#L252 And `HTTPConnection.__init__` has a `socket_options` argument: https://github.com/urllib3/urllib3/blob/e8a81de5e1fe6b52dbcbdd93345560097c1e58fb/src/urllib3/connection.py#L140 I haven't actually verified this works but it seems like you could thread the socket options through the client config vs trying to patch default options. **Steps to Reproduce** See https://github.com/jonminter-dojo/airflow-k8s-tcp-keepalive-repro/blob/main/main.py which has a minimal reproduction using `_enable_tcp_keepalive` and `get_kube_client` from `airflow.providers.cncf.kubernetes.kube_client` to demonstrate the issue. ### What you think should happen instead? When `enable_tcp_keepalive` is enabled the TCP keep alive socket options should be set on new connections opened for the kubernetes api client. ### Operating System Debian GNU/Linux 12 (bookworm) ### Deployment Official Apache Airflow Helm Chart ### Apache Airflow Provider(s) cncf-kubernetes ### Versions of Apache Airflow Providers 10.16.0 ### Official Helm Chart version Not Applicable ### Kubernetes Version Not Applicable ### Helm Chart configuration Not Applicable ### Docker Image customizations _No response_ ### Anything else? _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
