michalmisiewicz opened a new pull request #11406:
URL: https://github.com/apache/airflow/pull/11406


   When running Airflow on Cloud Kubernetes  Services like Azure Kubernetes 
Service, KubernetesExecutor can hangs indefinitely on pod submission due to 
idle connection time-out. For example in Azure Firewall, idle timeout is set to 
[4 
minutes](https://docs.microsoft.com/en-us/azure/firewall/firewall-faq#what-is-the-tcp-idle-timeout-for-azure-firewall).
 Simmilar problem can be observe on AWS.
   
   [Kubernetes 
client](https://github.com/kubernetes-client/python/tree/master/kubernetes) 
under the hood is using urllib3 PoolManager which introduce connections reuse. 
On idle timeout, RST package is send back to client which cause urllib3 to hang 
indefinitely. 
   
   This PR introduced fix based on TCP keepalive mechanism. Unfortunately 
Kubernetes client does not support providing socket options when instantiating 
`CoreV1Api` instance.
   
   It was nightmare to find out why request were hanging. After all I have run 
fix for one month on production without errors. Now I can sleep peacefully... 🛌 
   
   Closes #10636 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to