cantpitch commented on issue #33402:
URL: https://github.com/apache/airflow/issues/33402#issuecomment-1690123546

   FWIW, we are seeing this issue as well in our Azure AKS Airflow cluster. At 
some point it stops cleaning up "Completed" tasks and then schedules new ones 
but they don't start running. The solution of clearing out the "Completed" jobs 
periodically didn't help. It was only when we periodically restart the 
scheduler (every hour) that we were able to hack around it. 
   
   I was able to see a similar error to the above (Airflow 2.6.3):
   
   ```
   [2023-08-23T14:49:51.607+0000] {kubernetes_executor.py:114} ERROR - Unknown 
error in KubernetesJobWatcher. Failing
    Traceback (most recent call last):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
710, in _error_catcher
        yield
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
1077, in read_chunked
        self._update_chunk_length()
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
1005, in _update_chunk_length
        line = self._fp.fp.readline()  # type: ignore[union-attr]
      File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
        return self._sock.recv_into(b)
      File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
        return self.read(nbytes, buffer)
      File "/usr/local/lib/python3.7/ssl.py", line 929, in read
        return self._sslobj.read(len, buffer)
    ConnectionResetError: [Errno 104] Connection reset by peer
   
    The above exception was the direct cause of the following exception:
   
    Traceback (most recent call last):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
 line 106, in run
        kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
      File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
 line 161, in _run
        for event in self._pod_events(kube_client=kube_client, 
query_kwargs=kwargs):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py", 
line 165, in stream
    Stream closed EOF for airflow/airflow-scheduler-84dc4cc8d9-65g75 (check-db)
        for line in iter_resp_lines(resp):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py", 
line 56, in iter_resp_lines
        for seg in resp.stream(amt=None, decode_content=False):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
937, in stream
        yield from self.read_chunked(amt, decode_content=decode_content)
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
1106, in read_chunked
        self._original_response.close()
      File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
        self.gen.throw(type, value, traceback)
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
727, in _error_catcher
        raise ProtocolError(f"Connection broken: {e!r}", e) from e
    urllib3.exceptions.ProtocolError: ("Connection broken: 
ConnectionResetError(104, 'Connection reset by peer')", 
ConnectionResetError(104, 'Connection reset by peer'))
    Process KubernetesJobWatcher-3:
    Traceback (most recent call last):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
710, in _error_catcher
        yield
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
1077, in read_chunked
        self._update_chunk_length()
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
1005, in _update_chunk_length
        line = self._fp.fp.readline()  # type: ignore[union-attr]
      File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
        return self._sock.recv_into(b)
      File "/usr/local/lib/python3.7/ssl.py", line 1071, in recv_into
        return self.read(nbytes, buffer)
      File "/usr/local/lib/python3.7/ssl.py", line 929, in read
        return self._sslobj.read(len, buffer)
    ConnectionResetError: [Errno 104] Connection reset by peer
   
    The above exception was the direct cause of the following exception:
   
    Traceback (most recent call last):                                          
                                                   
      File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in 
_bootstrap
        self.run()
      File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
 line 106, in run
        kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
      File 
"/home/airflow/.local/lib/python3.7/site-packages/airflow/executors/kubernetes_executor.py",
 line 161, in _run
        for event in self._pod_events(kube_client=kube_client, 
query_kwargs=kwargs):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py", 
line 165, in stream
        for line in iter_resp_lines(resp):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/kubernetes/watch/watch.py", 
line 56, in iter_resp_lines
        for seg in resp.stream(amt=None, decode_content=False):
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
937, in stream
        yield from self.read_chunked(amt, decode_content=decode_content)
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
1106, in read_chunked
        self._original_response.close()
      File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
        self.gen.throw(type, value, traceback)
      File 
"/home/airflow/.local/lib/python3.7/site-packages/urllib3/response.py", line 
727, in _error_catcher
        raise ProtocolError(f"Connection broken: {e!r}", e) from e
    urllib3.exceptions.ProtocolError: ("Connection broken: 
ConnectionResetError(104, 'Connection reset by peer')", 
ConnectionResetError(104, 'Connection reset by peer'))
    [2023-08-23T14:49:52.127+0000] {kubernetes_executor.py:340} ERROR - Error 
while health checking kube watcher process for namespace airflow. Process died 
for unknown reasons  
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to