karunpoudel-chr commented on issue #36998:
URL: https://github.com/apache/airflow/issues/36998#issuecomment-2066762929

   I am seeing issue in single namespace.
   airflow==2.8.4
   apache-airflow-providers-cncf-kubernetes==7.14.0
   kubernetes==23.6.0
   
   `KubernetesJobWatcher` failed a couple times but it was able to restart.
   In the logs below, the Watcher running on PID: 2034 failed. On the next sync 
of the executor, it was able to start back with PID: 3740.
   
   ```
   [2024-04-18T23:29:34.285+0000] [2034:139691425343296] 
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:121} 
ERROR - Unknown error in KubernetesJobWatcher. Failing
   Traceback (most recent call last):
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
710, in _error_catcher
       yield
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
1073, in read_chunked
       self._update_chunk_length()
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
1008, in _update_chunk_length
       raise InvalidChunkLength(self, line) from None
   urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 
bytes read)
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File 
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 112, in run
       self.resource_version = self._run(
                               ^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 168, in _run
       for event in self._pod_events(kube_client=kube_client, 
query_kwargs=kwargs):
     File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", 
line 165, in stream
       for line in iter_resp_lines(resp):
     File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", 
line 56, in iter_resp_lines
       for seg in resp.stream(amt=None, decode_content=False):
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
933, in stream
       yield from self.read_chunked(amt, decode_content=decode_content)
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
1061, in read_chunked
       with self._error_catcher():
     File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
       self.gen.throw(typ, value, traceback)
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
727, in _error_catcher
       raise ProtocolError(f"Connection broken: {e!r}", e) from e
   urllib3.exceptions.ProtocolError: ("Connection broken: 
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got 
length b'', 0 bytes read))
   Process KubernetesJobWatcher-5:
   Traceback (most recent call last):
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
710, in _error_catcher
       yield
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
1073, in read_chunked
       self._update_chunk_length()
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
1008, in _update_chunk_length
       raise InvalidChunkLength(self, line) from None
   urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 
bytes read)
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in 
_bootstrap
       self.run()
     File 
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 112, in run
       self.resource_version = self._run(
                               ^^^^^^^^^^
     File 
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
 line 168, in _run
       for event in self._pod_events(kube_client=kube_client, 
query_kwargs=kwargs):
     File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", 
line 165, in stream
       for line in iter_resp_lines(resp):
     File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", 
line 56, in iter_resp_lines
       for seg in resp.stream(amt=None, decode_content=False):
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
933, in stream
       yield from self.read_chunked(amt, decode_content=decode_content)
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
1061, in read_chunked
       with self._error_catcher():
     File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
       self.gen.throw(typ, value, traceback)
     File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 
727, in _error_catcher
       raise ProtocolError(f"Connection broken: {e!r}", e) from e
   urllib3.exceptions.ProtocolError: ("Connection broken: 
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got 
length b'', 0 bytes read))
   [2024-04-18T23:29:35.067+0000] [ 7:139691425343296] 
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:449} 
DEBUG - Syncing KubernetesExecutor
   [2024-04-18T23:29:35.067+0000] [ 7:139691425343296] 
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:359} 
ERROR - Error while health checking kube watcher process for namespace 
astronomer-contractual-po-prod. Process died for unknown reasons
   [2024-04-18T23:29:35.078+0000] [3740:139691425343296] 
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:157} 
INFO - Event: and now my watch begins starting at resource_version: 0
   ```
   
   I was working fine for a minute; it was reporting back the pod changes.
   
   ```
   [2024-04-18T23:30:50.605+0000] [3740:139691425343296] 
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:170} 
DEBUG - Event: strategizer-monitor-check-strategizer-bid-statuses-lz2ww1cv had 
an event of type DELETED
   [2024-04-18T23:30:50.605+0000] [3740:139691425343296] 
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:256} 
INFO - Skipping event for Succeeded pod 
strategizer-monitor-check-strategizer-bid-statuses-lz2ww1cv - event for this 
pod already sent to executor
   ```
   
   After this the Watcher went silent, no logs with PID 3740. 
KubernetesExecuter.running set kept increasing:
   ```
   [2024-04-18T23:40:01.059+0000] [ 7:139691425343296] 
{airflow/executors/base_executor.py:217} DEBUG - 1 running task instances
   [2024-04-18T23:40:01.060+0000] [ 7:139691425343296] 
{airflow/executors/base_executor.py:218} DEBUG - 0 in queue
   [2024-04-18T23:40:01.060+0000] [ 7:139691425343296] 
{airflow/executors/base_executor.py:219} DEBUG - 31 open slots
   ...
   [2024-04-19T13:24:44.721+0000] [ 7:139691425343296] 
{airflow/executors/base_executor.py:217} DEBUG - 32 running task instances
   [2024-04-19T13:24:44.721+0000] [ 7:139691425343296] 
{airflow/executors/base_executor.py:218} DEBUG - 0 in queue
   [2024-04-19T13:24:44.721+0000] [ 7:139691425343296] 
{airflow/executors/base_executor.py:219} DEBUG - 0 open slots
   ```
   I confirmed that the PID: 3740 is still running.
   ```
   /usr/local/airflow$ ps -ef
   UID   PID PPID  C STIME     TIME CMD
   astro   1    0  0 Apr18 00:00:00 tini -- /entrypoint bash -c exec airflow 
scheduler
   astro   7    1  4 Apr18 00:37:17 /usr/local/bin/python 
/usr/local/bin/airflow scheduler
   astro  24    7  0 Apr18 00:01:06 /usr/local/bin/python 
/usr/local/bin/airflow scheduler
   astro  33    7  0 Apr18 00:00:41 /usr/local/bin/python 
/usr/local/bin/airflow scheduler
   astro  44    7  1 Apr18 00:12:28 airflow scheduler -- DagFileProcessorManager
   astro 740    7  0 Apr18 00:00:00 /usr/local/bin/python 
/usr/local/bin/airflow scheduler  <=== KubernetesJobWatcher
   ```
   So, it may not be getting any pod events back from Kubernetes, even though 
pods are being created and completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to