Roland de Boo created AIRFLOW-5282: -------------------------------------- Summary: KubernetesExecutor hangs when kubernetes API times out Key: AIRFLOW-5282 URL: https://issues.apache.org/jira/browse/AIRFLOW-5282 Project: Apache Airflow Issue Type: Bug Components: executor-kubernetes Affects Versions: 1.10.4 Reporter: Roland de Boo Assignee: Roland de Boo
While migrating to a different Kubernetes cluster, we observe that the scheduler hangs very frequently. No output is generated in the logs. The UI states: `{{The scheduler does not appear to be running. Last heartbeat was received 9 minutes ago.`}} I've attached py-spy to the scheduler process to investigate. This is the output: %Own %Total OwnTime TotalTime Function (filename:line) 100.00% 100.00% 38.00s 38.00s recv_into (OpenSSL/SSL.py:1821) 0.00% 100.00% 0.000s 38.00s sync (airflow/contrib/executors/kubernetes_executor.py:795) 0.00% 100.00% 0.000s 38.00s request_encode_body (urllib3/request.py:150) 0.00% 100.00% 0.000s 38.00s begin (http/client.py:307) 0.00% 100.00% 0.000s 38.00s recv_into (urllib3/contrib/pyopenssl.py:304) 0.00% 100.00% 0.000s 38.00s wrapper (airflow/utils/cli.py:74) 0.00% 100.00% 0.000s 38.00s create_namespaced_pod_with_http_info (kubernetes/client/apis/core_v1_api.py:6206) 0.00% 100.00% 0.000s 38.00s _make_request (urllib3/connectionpool.py:383) 0.00% 100.00% 0.000s 38.00s _execute_helper (airflow/jobs/scheduler_job.py:1412) Next, I've added debug logging to `cli.py` on urlib3. This confirmed that the scheduler hangs at the point where no response comes from the kubernetes API: (base) macbook:~ roland$ kubectl logs airflow-scheduler-7c5d6df8f4-w4tpn > logs.txt (base) macbook:~ roland$ cat logs.txt | grep pod_launcher.py [2019-08-20 14:40:10,471] \{pod_launcher.py:60} DEBUG - Pod Creation Request: [2019-08-20 14:40:10,652] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1', [2019-08-20 14:43:03,213] \{pod_launcher.py:60} DEBUG - Pod Creation Request: [2019-08-20 14:43:03,296] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1', [2019-08-20 14:44:38,078] \{pod_launcher.py:60} DEBUG - Pod Creation Request: [2019-08-20 14:44:38,409] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1', [2019-08-20 14:44:39,804] \{pod_launcher.py:60} DEBUG - Pod Creation Request: [2019-08-20 14:44:39,867] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1', [2019-08-20 14:45:21,785] \{pod_launcher.py:60} DEBUG - Pod Creation Request: [2019-08-20 14:45:22,040] \{pod_launcher.py:63} DEBUG - Pod Creation Response: {'api_version': 'v1', [2019-08-20 14:50:22,224] \{pod_launcher.py:60} DEBUG - Pod Creation Request: It appears that the timeout can already be specified in the `airflow.cfg`, but the default is infinite. In kubernetes section: `kube_client_request_args = \{"_request_timeout" : [60,60] }` When I added the timeout, the scheduler will not hang anymore, but exit with an error (and be restarted by Kubernetes). I propose a PR which: * Adds a default _request_timeout in airflow.cfg, so that default installations will not hang indefinitely * Catches the errors in the scheduler, and retries scheduling the task, in a similar fashion to how ApiExcptions are caught. -- This message was sent by Atlassian Jira (v8.3.2#803003)