[GitHub] [airflow] hterik opened a new issue #21465: Kubernetes scheduler crashes on transient Kubernets API 500 errors.

GitBox Wed, 09 Feb 2022 06:26:55 -0800


hterik opened a new issue #21465:
URL: https://github.com/apache/airflow/issues/21465



   ### Apache Airflow version
   
   2.2.2
   
   ### What happened
   
   Scheduler was running fine when suddenly hearbeat stopped and no more jobs 
were scheduled.
   Log presented following
   ```
   scheduler_job.py: ERROR - Exception when executing 
SchedulerJob._run_scheduler_loop
   Traceback (most recent call last):
     File "airflow/jobs/scheduler_job.py", line 628, in _execute
       self._run_scheduler_loop()
     File "airflow/jobs/scheduler_job.py", line 711, in _run_scheduler_loop
       self.executor.heartbeat()
     File "airflow/executors/base_executor.py", line 162, in heartbeat
       self.sync()
     File "airflow/executors/kubernetes_executor.py", line 621, in sync
       next_event = self.event_scheduler.run(blocking=False)
     File "/usr/local/lib/python3.9/sched.py", line 151, in run
       action(*argument, **kwargs)
     File "airflow/utils/event_scheduler.py", line 36, in repeat
       action(*args, **kwargs)
     File "airflow/executors/kubernetes_executor.py", line 643, in 
_check_worker_pods_pending_timeout
       for pod in pending_pods().items:
     File "kubernetes/client/api/core_v1_api.py", line 12803, in 
list_namespaced_pod
       (data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # 
noqa: E501
     File "kubernetes/client/api/core_v1_api.py", line 12891, in 
list_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File "kubernetes/client/api_client.py", line 340, in call_api
       return self.__call_api(resource_path, method,
     File "kubernetes/client/api_client.py", line 172, in __call_api
       response_data = self.request(
     File "kubernetes/client/api_client.py", line 362, in request
       return self.rest_client.GET(url,
     File "kubernetes/client/rest.py", line 237, in GET
       return self.request("GET", url,
     File "kubernetes/client/rest.py", line 231, in request
       raise ApiException(http_resp=r)
   kubernetes.client.rest.ApiException: (500)
   Reason: Internal Server Error
   HTTP response headers: HTTPHeaderDict({'Audit-Id': 'xxx', 'Cache-Control': 
'no-cache, private', 'Content-Type': 'application/json', 
'X-Kubernetes-Pf-Flowschema-Uid': 'yyy', 'X-Kubernetes-Pf-Prioritylevel-Uid': 
'zzz', 'Date': 'Wed, 09 Feb 2022 11:16:07 GMT', 'Content-Length': '119'})
   HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver:
 leader changed","code":500}
   ```
   
   Looks like Kubernetes API returned a temporary error that would have 
succeeded on next retry.
   
   ### What you expected to happen
   
   * Scheduler logs the error but continues ticking, implicitly retrying the 
`_check_worker_pods_pending_timeout` on next `_run_scheduler_loop`.
   * Gradual retry backoff time? 
   
   
   ### How to reproduce
   
   Not sure. Log says `etcdserver: leader changed","code":500` which would be 
hard to reprouce. Failure mode can probably be injected by simply disconnecting 
the network path Kubernetes cluster entirely.
   
   ### Operating System
   
   Debian GNU/Linux 10 (buster)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes==2.1.0
   apache-airflow-providers-docker==2.3.0
   apache-airflow-providers-http==2.0.1
   apache-airflow-providers-postgres==2.3.0
   apache-airflow-providers-ssh==2.3.0
   
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   Azure managed Kubernetes Service (AKS)
   
   ### Anything else
   
   Only seen once
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] hterik opened a new issue #21465: Kubernetes scheduler crashes on transient Kubernets API 500 errors.

Reply via email to