karunpoudel-chr commented on issue #36998:
URL: https://github.com/apache/airflow/issues/36998#issuecomment-2066762929
I am seeing issue in single namespace.
airflow==2.8.4
apache-airflow-providers-cncf-kubernetes==7.14.0
kubernetes==23.6.0
`KubernetesJobWatcher` failed a couple times but it was able to restart.
In the logs below, the Watcher running on PID: 2034 failed. On the next sync
of the executor, it was able to start back with PID: 3740.
```
[2024-04-18T23:29:34.285+0000] [2034:139691425343296]
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:121}
ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
710, in _error_catcher
yield
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
1073, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
1008, in _update_chunk_length
raise InvalidChunkLength(self, line) from None
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0
bytes read)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 112, in run
self.resource_version = self._run(
^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 168, in _run
for event in self._pod_events(kube_client=kube_client,
query_kwargs=kwargs):
File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py",
line 165, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py",
line 56, in iter_resp_lines
for seg in resp.stream(amt=None, decode_content=False):
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
933, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
1061, in read_chunked
with self._error_catcher():
File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
727, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
Process KubernetesJobWatcher-5:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
710, in _error_catcher
yield
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
1073, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
1008, in _update_chunk_length
raise InvalidChunkLength(self, line) from None
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0
bytes read)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/multiprocessing/process.py", line 314, in
_bootstrap
self.run()
File
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 112, in run
self.resource_version = self._run(
^^^^^^^^^^
File
"/usr/local/lib/python3.11/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py",
line 168, in _run
for event in self._pod_events(kube_client=kube_client,
query_kwargs=kwargs):
File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py",
line 165, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py",
line 56, in iter_resp_lines
for seg in resp.stream(amt=None, decode_content=False):
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
933, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
1061, in read_chunked
with self._error_catcher():
File "/usr/local/lib/python3.11/contextlib.py", line 158, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.11/site-packages/urllib3/response.py", line
727, in _error_catcher
raise ProtocolError(f"Connection broken: {e!r}", e) from e
urllib3.exceptions.ProtocolError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
[2024-04-18T23:29:35.067+0000] [ 7:139691425343296]
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:449}
DEBUG - Syncing KubernetesExecutor
[2024-04-18T23:29:35.067+0000] [ 7:139691425343296]
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:359}
ERROR - Error while health checking kube watcher process for namespace
astronomer-contractual-po-prod. Process died for unknown reasons
[2024-04-18T23:29:35.078+0000] [3740:139691425343296]
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:157}
INFO - Event: and now my watch begins starting at resource_version: 0
```
I was working fine for a minute; it was reporting back the pod changes.
```
[2024-04-18T23:30:50.605+0000] [3740:139691425343296]
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:170}
DEBUG - Event: strategizer-monitor-check-strategizer-bid-statuses-lz2ww1cv had
an event of type DELETED
[2024-04-18T23:30:50.605+0000] [3740:139691425343296]
{airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py:256}
INFO - Skipping event for Succeeded pod
strategizer-monitor-check-strategizer-bid-statuses-lz2ww1cv - event for this
pod already sent to executor
```
After this the Watcher went silent, no logs with PID 3740.
KubernetesExecuter.running set kept increasing:
```
[2024-04-18T23:40:01.059+0000] [ 7:139691425343296]
{airflow/executors/base_executor.py:217} DEBUG - 1 running task instances
[2024-04-18T23:40:01.060+0000] [ 7:139691425343296]
{airflow/executors/base_executor.py:218} DEBUG - 0 in queue
[2024-04-18T23:40:01.060+0000] [ 7:139691425343296]
{airflow/executors/base_executor.py:219} DEBUG - 31 open slots
...
[2024-04-19T13:24:44.721+0000] [ 7:139691425343296]
{airflow/executors/base_executor.py:217} DEBUG - 32 running task instances
[2024-04-19T13:24:44.721+0000] [ 7:139691425343296]
{airflow/executors/base_executor.py:218} DEBUG - 0 in queue
[2024-04-19T13:24:44.721+0000] [ 7:139691425343296]
{airflow/executors/base_executor.py:219} DEBUG - 0 open slots
```
I confirmed that the PID: 3740 is still running.
```
/usr/local/airflow$ ps -ef
UID PID PPID C STIME TIME CMD
astro 1 0 0 Apr18 00:00:00 tini -- /entrypoint bash -c exec airflow
scheduler
astro 7 1 4 Apr18 00:37:17 /usr/local/bin/python
/usr/local/bin/airflow scheduler
astro 24 7 0 Apr18 00:01:06 /usr/local/bin/python
/usr/local/bin/airflow scheduler
astro 33 7 0 Apr18 00:00:41 /usr/local/bin/python
/usr/local/bin/airflow scheduler
astro 44 7 1 Apr18 00:12:28 airflow scheduler -- DagFileProcessorManager
astro 740 7 0 Apr18 00:00:00 /usr/local/bin/python
/usr/local/bin/airflow scheduler <=== KubernetesJobWatcher
```
So, it may not be getting any pod events back from Kubernetes, even though
pods are being created and completed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]