nguyenmphu opened a new issue, #28328:
URL: https://github.com/apache/airflow/issues/28328
### Apache Airflow version
Other Airflow 2 version (please specify below)
### What happened
Airflow version: `2.3.4`
I have deployed airflow with the official Helm in K8s with
`KubernetesExecutor`. Sometimes the scheduler hang when calling K8s API. The
log:
``` bash
ERROR - Exception when executing Executor.end
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
line 752, in _execute
self._run_scheduler_loop()
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
line 842, in _run_scheduler_loop
self.executor.heartbeat()
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/base_executor.py",
line 171, in heartbeat
self.sync()
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
line 649, in sync
next_event = self.event_scheduler.run(blocking=False)
File "/usr/local/lib/python3.8/sched.py", line 151, in run
action(*argument, **kwargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/event_scheduler.py",
line 36, in repeat
action(*args, **kwargs)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
line 673, in _check_worker_pods_pending_timeout
for pod in pending_pods().items:
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py",
line 15697, in list_namespaced_pod
return self.list_namespaced_pod_with_http_info(namespace, **kwargs) #
noqa: E501
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py",
line 15812, in list_namespaced_pod_with_http_info
return self.api_client.call_api(
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py",
line 348, in call_api
return self.__call_api(resource_path, method,
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py",
line 180, in __call_api
response_data = self.request(
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py",
line 373, in request
return self.rest_client.GET(url,
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py",
line 240, in GET
return self.request("GET", url,
File
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py",
line 213, in request
r = self.pool_manager.request(method, url,
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 74,
in request
return self.request_encode_url(
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 96,
in request_encode_url
return self.urlopen(method, url, **extra_kw)
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/poolmanager.py", line
376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py",
line 815, in urlopen
return self.urlopen(
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py",
line 703, in urlopen
httplib_response = self._make_request(
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py",
line 386, in _make_request
self._validate_conn(conn)
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py",
line 1042, in _validate_conn
conn.connect()
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line
358, in connect
self.sock = conn = self._new_conn()
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line
174, in _new_conn
conn = connection.create_connection(
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/util/connection.py",
line 85, in create_connection
sock.connect(sa)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
line 182, in _exit_gracefully
sys.exit(os.EX_OK)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
line 773, in _execute
self.executor.end()
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
line 823, in end
self._flush_task_queue()
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
line 776, in _flush_task_queue
self.log.debug('Executor shutting down, task_queue approximate size=%d',
self.task_queue.qsize())
File "<string>", line 2, in qsize
File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 835, in
_callmethod
kind, result = conn.recv()
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 250,
in recv
buf = self._recv_bytes()
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 414,
in _recv_bytes
buf = self._recv(4)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 379,
in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
```
Then the executor process was killed and the pod was still running. But the
scheduler does not work.
After restarting, the scheduler worked usually.
### What you think should happen instead
When the error occurs, the executor needs to auto restart or the scheduler
should be killed.
### How to reproduce
_No response_
### Operating System
Debian GNU/Linux 11 (bullseye)
### Versions of Apache Airflow Providers
_No response_
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else
_No response_
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]