karunpoudel-chr commented on issue #36998: URL: https://github.com/apache/airflow/issues/36998#issuecomment-2066762929
I am seeing issue in single namespace. airflow==2.8.4 apache-airflow-providers-cncf-kubernetes==7.14.0 kubernetes==23.6.0 `KubernetesJobWatcher` failed a couple times but it was able to restart. In the logs below, the Watcher running on PID: 2034 failed. On the next sync of the executor, it was able to start back with PID: 3740. ``` [2024-04-18T23:29:34.285+0000] [2034:139691425343296] {airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:121} ERROR - Unknown error in KubernetesJobWatcher. Failing Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 710, in _error_catcher yield File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1073, in read_chunked self._update_chunk_length() File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1008, in _update_chunk_length raise InvalidChunkLength(self, line) from None urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 112, in run self.resource_version = self._run( ^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 168, in _run for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs): File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 165, in stream for line in iter_resp_lines(resp): File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines for seg in resp.stream(amt=None, decode_content=False): File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 933, in stream yield from self.read_chunked(amt, decode_content=decode_content) File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1061, in read_chunked with self._error_catcher(): File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 727, in _error_catcher raise ProtocolError(f"Connection broken: {e!r}", e) from e urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) Process KubernetesJobWatcher-5: Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 710, in _error_catcher yield File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1073, in read_chunked self._update_chunk_length() File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1008, in _update_chunk_length raise InvalidChunkLength(self, line) from None urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 112, in run self.resource_version = self._run( ^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 168, in _run for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs): File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 165, in stream for line in iter_resp_lines(resp): File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines for seg in resp.stream(amt=None, decode_content=False): File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 933, in stream yield from self.read_chunked(amt, decode_content=decode_content) File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 1061, in read_chunked with self._error_catcher(): File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__ self.gen.throw(typ, value, traceback) File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line 727, in _error_catcher raise ProtocolError(f"Connection broken: {e!r}", e) from e urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read)) [2024-04-18T23:29:35.067+0000] [ 7:139691425343296] {airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:449} DEBUG - Syncing KubernetesExecutor [2024-04-18T23:29:35.067+0000] [ 7:139691425343296] {airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:359} ERROR - Error while health checking kube watcher process for namespace astronomer-contractual-po-prod. Process died for unknown reasons [2024-04-18T23:29:35.078+0000] [3740:139691425343296] {airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:157} INFO - Event: and now my watch begins starting at resource_version: 0 ``` I was working fine for a minute; it was reporting back the pod changes. ``` [2024-04-18T23:30:50.605+0000] [3740:139691425343296] {airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:170} DEBUG - Event: strategizer-monitor-check-strategizer-bid-statuses-lz2ww1cv had an event of type DELETED [2024-04-18T23:30:50.605+0000] [3740:139691425343296] {airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:256} INFO - Skipping event for Succeeded pod strategizer-monitor-check-strategizer-bid-statuses-lz2ww1cv - event for this pod already sent to executor ``` After this the Watcher went silent, no logs with PID 3740. KubernetesExecuter.running set kept increasing: ``` [2024-04-18T23:40:01.059+0000] [ 7:139691425343296] {airflow/executors/base_executor.py:217} DEBUG - 1 running task instances [2024-04-18T23:40:01.060+0000] [ 7:139691425343296] {airflow/executors/base_executor.py:218} DEBUG - 0 in queue [2024-04-18T23:40:01.060+0000] [ 7:139691425343296] {airflow/executors/base_executor.py:219} DEBUG - 31 open slots ... [2024-04-19T13:24:44.721+0000] [ 7:139691425343296] {airflow/executors/base_executor.py:217} DEBUG - 32 running task instances [2024-04-19T13:24:44.721+0000] [ 7:139691425343296] {airflow/executors/base_executor.py:218} DEBUG - 0 in queue [2024-04-19T13:24:44.721+0000] [ 7:139691425343296] {airflow/executors/base_executor.py:219} DEBUG - 0 open slots ``` I confirmed that the PID: 3740 is still running. ``` /usr/local/airflow$ ps -ef UID PID PPID C STIME TIME CMD astro 1 0 0 Apr18 00:00:00 tini -- /entrypoint bash -c exec airflow scheduler astro 7 1 4 Apr18 00:37:17 /usr/local/bin/python /usr/local/bin/airflow scheduler astro 24 7 0 Apr18 00:01:06 /usr/local/bin/python /usr/local/bin/airflow scheduler astro 33 7 0 Apr18 00:00:41 /usr/local/bin/python /usr/local/bin/airflow scheduler astro 44 7 1 Apr18 00:12:28 airflow scheduler -- DagFileProcessorManager astro 740 7 0 Apr18 00:00:00 /usr/local/bin/python /usr/local/bin/airflow scheduler <=== KubernetesJobWatcher ``` So, it may not be getting any pod events back from Kubernetes, even though pods are being created and completed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org