michalmisiewicz opened a new pull request #11406: URL: https://github.com/apache/airflow/pull/11406
When running Airflow on Cloud Kubernetes Services like Azure Kubernetes Service, KubernetesExecutor can hangs indefinitely on pod submission due to idle connection time-out. For example in Azure Firewall, idle timeout is set to [4 minutes](https://docs.microsoft.com/en-us/azure/firewall/firewall-faq#what-is-the-tcp-idle-timeout-for-azure-firewall). Simmilar problem can be observe on AWS. [Kubernetes client](https://github.com/kubernetes-client/python/tree/master/kubernetes) under the hood is using urllib3 PoolManager which introduce connections reuse. On idle timeout, RST package is send back to client which cause urllib3 to hang indefinitely. This PR introduced fix based on TCP keepalive mechanism. Unfortunately Kubernetes client does not support providing socket options when instantiating `CoreV1Api` instance. It was nightmare to find out why request were hanging. After all I have run fix for one month on production without errors. Now I can sleep peacefully... 🛌 Closes #10636 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
