bixel commented on issue #36998:
URL: https://github.com/apache/airflow/issues/36998#issuecomment-1959360230
It looks like the scheduler or the kubernetes_executor cannot recover from
communication issues with kubernetes. I've collected a few hours of logging
after a restart of the scheduler and the problems seem to occur after following
lines:
```
[2024-02-22T10:24:59.431+0000] {kubernetes_executor_utils.py:121} ERROR -
Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
710, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1073, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1008, in _update_chunk_length
raise InvalidChunkLength(self, line) from None
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0
bytes read)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 112, in run
self.resource_version = self._run(
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 168, in _run
for event in self._pod_events(kube_client=kube_client,
query_kwargs=kwargs):
File
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py",
line 165, in stream
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py",
line 56, in iter_resp_lines
for seg in resp.stream(amt=None, decode_content=False):
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
933, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1061, in read_chunked
with self._error_catcher():
File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
727, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
710, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1073, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1008, in _update_chunk_length
raise InvalidChunkLength(self, line) from None
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0
bytes read)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in
_bootstrap
self.run()
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 112, in run
self.resource_version = self._run(
File
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 168, in _run
for event in self._pod_events(kube_client=kube_client,
query_kwargs=kwargs):
File
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py",
line 165, in stream
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py",
line 56, in iter_resp_lines
for seg in resp.stream(amt=None, decode_content=False):
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
933, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
1061, in read_chunked
with self._error_catcher():
File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line
727, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
[2024-02-22T10:24:59.568+0000] {kubernetes_executor_utils.py:359} ERROR -
Error while health checking kube watcher process for namespace airflow. Process
died for unknown reasons
[2024-02-22T10:24:59.586+0000] {kubernetes_executor_utils.py:157} INFO -
Event: and now my watch begins starting at resource_version: 0
```
After that, tasks are stuck in queued and I don't see any more lines of the
kind
```
[2024-02-22T10:16:50.260+0000] {scheduler_job_runner.py:696} INFO - Received
executor event with state success for task instance TaskInstanceKey
```
I can only recover that state by clearing all scheduled and queued tasks
*and* restarting the scheduler. I wasn't able to dig deeper into the
kubernetes_executor by now, but there seem to be quite a few changes between
2.7.3 and 2.8.1, so that would be my first guess for the origin of this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]