zappallot commented on issue #12136:
URL: https://github.com/apache/airflow/issues/12136#issuecomment-956227803


   hi there,
   
   we experience the same problem with Airflow 2.1.3 on Kubernetes 1.20.7, 
running as EKS service in AWS.
   the sourcecode for detecting the problem was the following:
   ```
   t1 = KubernetesPodOperator(
           task_id='bash-image',
           name='bash-image',
           image = 'bash:latest',
           cmds=[
               "bash",
               "-c",
               "while true; do echo 'hello bash'; sleep 10; done;"
               ],
           resources={'request_cpu': "1m", 'limit_cpu': "50m",
                      'request_memory': "8M", 'limit_memory': "32M"}
       )
   ```
   the error message when this task fails:
   ```
   [2021-10-28 09:48:12,961] {pod_launcher.py:149} INFO - hello bash
   ...
   [2021-10-29 19:01:49,792] {pod_launcher.py:149} INFO - hello bash
   [2021-10-29 19:01:52,639] {taskinstance.py:1462} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
697, in _update_chunk_length
       self.chunk_left = int(line, 16)
   ValueError: invalid literal for int() with base 16: b''
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
438, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
764, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
701, in _update_chunk_length
       raise InvalidChunkLength(self, line)
   urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 
bytes read)
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1164, in _run_raw_task
       self._prepare_and_execute_task_with_callbacks(context, task)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1282, in _prepare_and_execute_task_with_callbacks
       result = self._execute_task(context, task_copy)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py",
 line 1312, in _execute_task
       result = task_copy.execute(context=context)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
 line 367, in execute
       final_state, remote_pod, result = 
self.create_new_pod_for_operator(labels, launcher)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py",
 line 521, in create_new_pod_for_operator
       final_state, remote_pod, result = launcher.monitor_pod(pod=self.pod, 
get_logs=self.get_logs)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/utils/pod_launcher.py",
 line 147, in monitor_pod
       for line in logs:
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
808, in __iter__
       for chunk in self.stream(decode_content=True):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
572, in stream
       for line in self.read_chunked(amt, decode_content=decode_content):
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
793, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
       self.gen.throw(type, value, traceback)
     File 
"/home/airflow/.local/lib/python3.8/site-packages/urllib3/response.py", line 
455, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ("Connection broken: 
InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got 
length b'', 0 bytes read))
   [2021-10-29 19:01:52,642] {taskinstance.py:1505} INFO - Marking task as 
FAILED. dag_id=log_collection_tester-dv01, task_id=bash-image, 
execution_date=20211028T094738, start_date=20211028T094801, 
end_date=20211029T190152
   ```
   
   During the same time when this error comes up, we detected 
**LeaderElection** events from four different Kubernetes control plane services 
in EKS, which are a blackbox for us.
   - Events from the Lease kube-system/kube-scheduler (2021-10-29 19:01:58 
+0000 UTC)
   - Events from the Lease kube-system/kube-controller-manager (2021-10-29 
19:01:59 +0000 UTC)
   - Events from the ConfigMap kube-system/eks-certificates-controller 
(2021-10-29 19:02:11 +0000 UTC)
   - Events from the ConfigMap kube-system/cp-vpc-resource-controller 
(2021-10-29 19:11:29 +0000 UTC)
   
   These events take between 10 and 25 minutes and restarting processes during 
that time only result in another error like above.
   
   So it looks to me like some maintenance on the Kubernetes control plane 
servers causes this problem on our Airflow system. The longer the task runs, 
the more likely it is that this error comes up in a process. 
   This is quite an annoying error, since the time it takes to restart the 
control plane servers is so long, you cannot workaround that easily with a 
simple restart. We did catch this error in both TEST and PROD environments (we 
have two clusters).
   Does anyone has an idea how to fix that or what a workaround could be 
without disabling the logging for all tasks?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to