zappallot commented on issue #12136:
URL: https://github.com/apache/airflow/issues/12136#issuecomment-956227803
hi there,
we experience the same problem with Airflow 2.1.3 on Kubernetes 1.20.7,
running as EKS service in AWS.
the sourcecode for detecting the problem was the following:
```
t1 = KubernetesPodOperator(
task_id='bash-image',
name='bash-image',
image = 'bash:latest',
cmds=[
"bash",
"-c",
"while true; do echo 'hello bash'; sleep 10; done;"
],
resources={'request_cpu': "1m", 'limit_cpu': "50m",
'request_memory': "8M", 'limit_memory': "32M"}
)
```
the error message when this task fails:
```
[2021-10-28 09:48:12,961] {pod_launcher.py:149} INFO - hello bash
...
[2021-10-29 19:01:49,792] {pod_launcher.py:149} INFO - hello bash
[2021-10-29 19:01:52,639] {taskinstance.py:1462} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
697, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
438, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
764, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
701, in _update_chunk_length
raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0
bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1164, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1282, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
line 1312, in _execute_task
result = task_copy.execute(context=context)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
line 367, in execute
final_state, remote_pod, result =
self.create_new_pod_for_operator(labels, launcher)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
line 521, in create_new_pod_for_operator
final_state, remote_pod, result = launcher.monitor_pod(pod=self.pod,
get_logs=self.get_logs)
File
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/utils/pod_launcher.py",
line 147, in monitor_pod
for line in logs:
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
808, in __iter__
for chunk in self.stream(decode_content=True):
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
572, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
793, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line
455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken:
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got
length b'', 0 bytes read))
[2021-10-29 19:01:52,642] {taskinstance.py:1505} INFO - Marking task as
FAILED. dag_id=log_collection_tester-dv01, task_id=bash-image,
execution_date=20211028T094738, start_date=20211028T094801,
end_date=20211029T190152
```
During the same time when this error comes up, we detected
**LeaderElection** events from four different Kubernetes control plane services
in EKS, which are a blackbox for us.
- Events from the Lease kube-system/kube-scheduler (2021-10-29 19:01:58
+0000 UTC)
- Events from the Lease kube-system/kube-controller-manager (2021-10-29
19:01:59 +0000 UTC)
- Events from the ConfigMap kube-system/eks-certificates-controller
(2021-10-29 19:02:11 +0000 UTC)
- Events from the ConfigMap kube-system/cp-vpc-resource-controller
(2021-10-29 19:11:29 +0000 UTC)
These events take between 10 and 25 minutes and restarting processes during
that time only result in another error like above.
So it looks to me like some maintenance on the Kubernetes control plane
servers causes this problem on our Airflow system. The longer the task runs,
the more likely it is that this error comes up in a process.
This is quite an annoying error, since the time it takes to restart the
control plane servers is so long, you cannot workaround that easily with a
simple restart. We did catch this error in both TEST and PROD environments (we
have two clusters).
Does anyone has an idea how to fix that or what a workaround could be
without disabling the logging for all tasks?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]