hterik opened a new issue #21465:
URL: https://github.com/apache/airflow/issues/21465
### Apache Airflow version
2.2.2
### What happened
Scheduler was running fine when suddenly hearbeat stopped and no more jobs
were scheduled.
Log presented following
```
scheduler_job.py: ERROR - Exception when executing
SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
File "airflow/jobs/scheduler_job.py", line 628, in _execute
self._run_scheduler_loop()
File "airflow/jobs/scheduler_job.py", line 711, in _run_scheduler_loop
self.executor.heartbeat()
File "airflow/executors/base_executor.py", line 162, in heartbeat
self.sync()
File "airflow/executors/kubernetes_executor.py", line 621, in sync
next_event = self.event_scheduler.run(blocking=False)
File "/usr/local/lib/python3.9/sched.py", line 151, in run
action(*argument, **kwargs)
File "airflow/utils/event_scheduler.py", line 36, in repeat
action(*args, **kwargs)
File "airflow/executors/kubernetes_executor.py", line 643, in
_check_worker_pods_pending_timeout
for pod in pending_pods().items:
File "kubernetes/client/api/core_v1_api.py", line 12803, in
list_namespaced_pod
(data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs) #
noqa: E501
File "kubernetes/client/api/core_v1_api.py", line 12891, in
list_namespaced_pod_with_http_info
return self.api_client.call_api(
File "kubernetes/client/api_client.py", line 340, in call_api
return self.__call_api(resource_path, method,
File "kubernetes/client/api_client.py", line 172, in __call_api
response_data = self.request(
File "kubernetes/client/api_client.py", line 362, in request
return self.rest_client.GET(url,
File "kubernetes/client/rest.py", line 237, in GET
return self.request("GET", url,
File "kubernetes/client/rest.py", line 231, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'xxx', 'Cache-Control':
'no-cache, private', 'Content-Type': 'application/json',
'X-Kubernetes-Pf-Flowschema-Uid': 'yyy', 'X-Kubernetes-Pf-Prioritylevel-Uid':
'zzz', 'Date': 'Wed, 09 Feb 2022 11:16:07 GMT', 'Content-Length': '119'})
HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver:
leader changed","code":500}
```
Looks like Kubernetes API returned a temporary error that would have
succeeded on next retry.
### What you expected to happen
* Scheduler logs the error but continues ticking, implicitly retrying the
`_check_worker_pods_pending_timeout` on next `_run_scheduler_loop`.
* Gradual retry backoff time?
### How to reproduce
Not sure. Log says `etcdserver: leader changed","code":500` which would be
hard to reprouce. Failure mode can probably be injected by simply disconnecting
the network path Kubernetes cluster entirely.
### Operating System
Debian GNU/Linux 10 (buster)
### Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==2.1.0
apache-airflow-providers-docker==2.3.0
apache-airflow-providers-http==2.0.1
apache-airflow-providers-postgres==2.3.0
apache-airflow-providers-ssh==2.3.0
### Deployment
Docker-Compose
### Deployment details
Azure managed Kubernetes Service (AKS)
### Anything else
Only seen once
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]