cantpitch commented on issue #33402:
URL: https://github.com/apache/airflow/issues/33402#issuecomment-1690123546
FWIW, we are seeing this issue as well in our Azure AKS Airflow cluster. At
some point it stops cleaning up "Completed" tasks and then schedules new ones
but they don't start running. The solution of clearing out the "Completed" jobs
periodically didn't help. It was only when we periodically restart the
scheduler (every hour) that we were able to hack around it.
I was able to see a similar error to the above (Airflow 2.6.3):
```
[2023-08-23T14:49:51.607+0000] {kubernetes_executor.py:114} ERROR - Unknown
error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
710, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
1077, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
1005, in _update_chunk_length
line = self._fp.fp.readline() # type: ignore[union-attr]
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
line 106, in run
kube_client, self.resource_version, self.scheduler_job_id,
self.kube_config
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
line 161, in _run
for event in self._pod_events(kube_client=kube_client,
query_kwargs=kwargs):
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py",
line 165, in stream
Stream closed EOF for airflow/airflow-scheduler-84dc4cc8d9-65g75 (check-db)
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py",
line 56, in iter_resp_lines
for seg in resp.stream(amt=None, decode_content=False):
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
937, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
1106, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
727, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken:
ConnectionResetError(104, 'Connection reset by peer')",
ConnectionResetError(104, 'Connection reset by peer'))
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
710, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
1077, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
1005, in _update_chunk_length
line = self._fp.fp.readline() # type: ignore[union-attr]
File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/lib/python3.7/ssl.py", line 929, in read
return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in
_bootstrap
self.run()
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
line 106, in run
kube_client, self.resource_version, self.scheduler_job_id,
self.kube_config
File
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
line 161, in _run
for event in self._pod_events(kube_client=kube_client,
query_kwargs=kwargs):
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py",
line 165, in stream
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py",
line 56, in iter_resp_lines
for seg in resp.stream(amt=None, decode_content=False):
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
937, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
1106, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line
727, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken:
ConnectionResetError(104, 'Connection reset by peer')",
ConnectionResetError(104, 'Connection reset by peer'))
[2023-08-23T14:49:52.127+0000] {kubernetes_executor.py:340} ERROR - Error
while health checking kube watcher process for namespace airflow. Process died
for unknown reasons
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]