jmullins edited a comment on issue #12644:
URL: https://github.com/apache/airflow/issues/12644#issuecomment-758154491


   We consistently experienced kubernetes executor slot starvation, as 
described above, where worker pods get stuck in a completed state and are never 
deleted due to indefinite blocking in the KubernetesJobWatcher watch:
   
   
https://github.com/apache/airflow/blob/1.10.14/airflow/executors/kubernetes_executor.py#L315-L322
   
https://github.com/apache/airflow/blob/1.10.14/airflow/executors/kubernetes_executor.py#L315-L322
   
   The indefinite blocking is due to a lack of tcp keepalives or a default 
_request_timeout (socket timeout) in kube_client_request_args:
   
https://github.com/apache/airflow/blob/2.0.0/airflow/config_templates/default_airflow.cfg#L990
   
   We were able to consistently reproduce this behavior by injecting network 
faults or clearing the conntrack state  on the node where the scheduler was 
running as part of an overlay network.
   
   Setting a socket timeout, _request_timeout in kube_client_request_args, 
prevents executor slot starvation since the KubernetesJobWatcher recovers once 
the timeout is reached and properly cleans up worker pods stuck in the 
completed state.
   
   `
   kube_client_request_args = { "_request_timeout": 600 }
   `
   
   We currently set the _request_timeout to 10 minutes so we won't see a 
timeout unless there's a network fault -- since the kubernetes watch itself 
will expire before this (after 5 min).
   
   I think it makes sense to consider setting a default _request_timeout, even 
if the value is high, to protect against executor slot starvation and 
unavailability during network faults.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to