nguyenmphu opened a new issue, #28328:
URL: https://github.com/apache/airflow/issues/28328

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   Airflow version: `2.3.4`
   
   I have deployed airflow with the official Helm in K8s with 
`KubernetesExecutor`. Sometimes the scheduler hang when calling  K8s API. The 
log:
   ``` bash
   ERROR - Exception when executing Executor.end
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
 line 752, in _execute
       self._run_scheduler_loop()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
 line 842, in _run_scheduler_loop
       self.executor.heartbeat()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/base_executor.py",
 line 171, in heartbeat
       self.sync()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
 line 649, in sync
       next_event = self.event_scheduler.run(blocking=False)
     File "/usr/local/lib/python3.8/sched.py", line 151, in run
       action(*argument, **kwargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/event_scheduler.py",
 line 36, in repeat
       action(*args, **kwargs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
 line 673, in _check_worker_pods_pending_timeout
       for pod in pending_pods().items:
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py",
 line 15697, in list_namespaced_pod
       return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # 
noqa: E501
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py",
 line 15812, in list_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py",
 line 348, in call_api
       return self.__call_api(resource_path, method,
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py",
 line 180, in __call_api
       response_data = self.request(
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py",
 line 373, in request
       return self.rest_client.GET(url,
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", 
line 240, in GET
       return self.request("GET", url,
     File 
"/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", 
line 213, in request
       r = self.pool_manager.request(method, url,
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 74, 
in request
       return self.request_encode_url(
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 96, 
in request_encode_url
       return self.urlopen(method, url, **extra_kw)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/poolmanager.py", line 
376, in urlopen
       response = conn.urlopen(method, u.request_uri, **kw)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", 
line 815, in urlopen
       return self.urlopen(
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", 
line 703, in urlopen
       httplib_response = self._make_request(
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", 
line 386, in _make_request
       self._validate_conn(conn)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", 
line 1042, in _validate_conn
       conn.connect()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 
358, in connect
       self.sock = conn = self._new_conn()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 
174, in _new_conn
       conn = connection.create_connection(
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/util/connection.py", 
line 85, in create_connection
       sock.connect(sa)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
 line 182, in _exit_gracefully
       sys.exit(os.EX_OK)
   SystemExit: 0
   During handling of the above exception, another exception occurred:
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py",
 line 773, in _execute
       self.executor.end()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
 line 823, in end
       self._flush_task_queue()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py",
 line 776, in _flush_task_queue
       self.log.debug('Executor shutting down, task_queue approximate size=%d', 
self.task_queue.qsize())
     File "<string>", line 2, in qsize
     File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 835, in 
_callmethod
       kind, result = conn.recv()
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 250, 
in recv
       buf = self._recv_bytes()
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 414, 
in _recv_bytes
       buf = self._recv(4)
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 379, 
in _recv
       chunk = read(handle, remaining)
   ConnectionResetError: [Errno 104] Connection reset by peer
   ```
   Then the executor process was killed and the pod was still running. But the 
scheduler does not work.
   
   After restarting, the scheduler worked usually.
   
   ### What you think should happen instead
   
   When the error occurs, the executor needs to auto restart or the scheduler 
should be killed. 
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to