bixel commented on issue #36998:
URL: https://github.com/apache/airflow/issues/36998#issuecomment-1959360230

   It looks like the scheduler or the kubernetes_executor cannot recover from 
communication issues with kubernetes. I've collected a few hours of logging 
after a restart of the scheduler and the problems seem to occur after following 
lines:
   
   ```
   [2024-02-22T10:24:59.431+0000] {kubernetes_executor_utils.py:121} ERROR - 
Unknown error in KubernetesJobWatcher. Failing
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
710, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
1073, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
1008, in _update_chunk_length
       raise InvalidChunkLength(self, line) from None
   urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 
bytes read)
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 112, in run
       self.resource_version = self._run(
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 168, in _run
       for event in self._pod_events(kube_client=kube_client, 
query_kwargs=kwargs):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", 
line 165, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", 
line 56, in iter_resp_lines
       for seg in resp.stream(amt=None, decode_content=False):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
933, in stream
       yield from self.read_chunked(amt, decode_content=decode_content)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
1061, in read_chunked
       with self._error_catcher():
     File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
       self.gen.throw(typ, value, traceback)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
727, in _error_catcher
       raise ProtocolError(f"Connection broken: {e!r}", e) from e
   urllib3.exceptions.ProtocolError: ("Connection broken: 
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got 
length b'', 0 bytes read))
   Process KubernetesJobWatcher-3:
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
710, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
1073, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
1008, in _update_chunk_length
       raise InvalidChunkLength(self, line) from None
   urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 
bytes read)
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in 
_bootstrap
       self.run()
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 112, in run
       self.resource_version = self._run(
     File 
"/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 168, in _run
       for event in self._pod_events(kube_client=kube_client, 
query_kwargs=kwargs):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", 
line 165, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/kubernetes/watch/watch.py", 
line 56, in iter_resp_lines
       for seg in resp.stream(amt=None, decode_content=False):
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
933, in stream
       yield from self.read_chunked(amt, decode_content=decode_content)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
1061, in read_chunked
       with self._error_catcher():
     File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
       self.gen.throw(typ, value, traceback)
     File 
"/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 
727, in _error_catcher
       raise ProtocolError(f"Connection broken: {e!r}", e) from e
   urllib3.exceptions.ProtocolError: ("Connection broken: 
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got 
length b'', 0 bytes read))
   [2024-02-22T10:24:59.568+0000] {kubernetes_executor_utils.py:359} ERROR - 
Error while health checking kube watcher process for namespace airflow. Process 
died for unknown reasons
   [2024-02-22T10:24:59.586+0000] {kubernetes_executor_utils.py:157} INFO - 
Event: and now my watch begins starting at resource_version: 0
   ```
   
   After that, tasks are stuck in queued and I don't see any more lines of the 
kind
   ```
   [2024-02-22T10:16:50.260+0000] {scheduler_job_runner.py:696} INFO - Received 
executor event with state success for task instance TaskInstanceKey
   ```
   
   I can only recover that state by clearing all scheduled and queued tasks 
*and* restarting the scheduler. I wasn't able to dig deeper into the 
kubernetes_executor by now, but there seem to be quite a few changes between 
2.7.3 and 2.8.1, so that would be my first guess for the origin of this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to