[jira] [Commented] (AIRFLOW-5571) Kubernetes operator's bug that get logs will make pod exit unexpectedly

William Lachance (Jira) Fri, 24 Jan 2020 14:58:24 -0800


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023320#comment-17023320
 ]


William Lachance commented on AIRFLOW-5571:
-------------------------------------------

I am 99% certain this is a duplicate of AIRFLOW-3534, but IMO the bug is a 
little more clearly described there (I think the problem is really that we're 
not timing out correctly for this case, which I'm about to file a PR for). I'd 
personally recommend closing this one.

> Kubernetes operator's bug that get logs will make pod exit unexpectedly
> -----------------------------------------------------------------------
>
>                 Key: AIRFLOW-5571
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5571
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: operators
>    Affects Versions: 1.10.0, 1.10.1, 1.10.2, 1.10.3
>            Reporter: Liu Xuesi
>            Priority: Critical
>              Labels: kubernetes, operator
>         Attachments: k8s_error_log.png
>
>
> In my project, I manage 200+ jobs in kubernetes, as I work on, I found a 
> critical bug for kubernetes operator.
> In pod_launcher.py, *_monitor_pod* function:
> {code:python}
>         if get_logs:
>             logs = self.read_pod_logs(pod) # Here has a retry.
>             for line in logs:  # But exception throw from here!
>                 self.
> {code}
> in above code, *logs* is a HttpResponse, as it implemented __iter__() 
> function, you can use for loop to print the lines. In the other words, here 
> use a http long connection to get endless log.
> There is only try catch over *self.read_pod_logs*, however, If the network is 
> disconnected or jitter occurs, for loop will throw error.
> As I have 200+ job run everyday, I can get 4~5 errors everyday, and each 
> error will let monitor think the pod is down, and then mark the task as 
> failed, then retry it. This eventually lead to data error.
>  
> Below is a typical error log:
> {code:java}
>  [2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17 
> 20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:00 INFO 
> yarn.Client: Application report for application_1565926539066_3866 (state: 
> RUNNING)\n'
> [2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17 
> 20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:01 INFO 
> yarn.Client: Application report for application_1565926539066_3866 (state: 
> RUNNING)\n'
> [2019-09-17 20:50:02,533] {taskinstance.py:1047} ERROR - ('Connection broken: 
> IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 639, in _update_chunk_length
>     self.chunk_left = int(line, 16)
> ValueError: invalid literal for int() with base 16: b''
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 397, in _error_catcher
>     yield
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 704, in read_chunked
>     self._update_chunk_length()
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 643, in _update_chunk_length
>     raise httplib.IncompleteRead(line)
> http.client.IncompleteRead: IncompleteRead(0 bytes read)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python3.6/dist-packages/airflow/models/taskinstance.py", line 
> 922, in _run_raw_task
>     result = task_copy.execute(context=context)
>   File 
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/k8s_pod_operator.py",
>  line 45, in execute
>     super().execute(context)
>   File 
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/kubernetes_pod_operator.py",
>  line 148, in execute
>     get_logs=self.get_logs)
>   File 
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py",
>  line 97, in run_pod
>     return self._monitor_pod(pod, get_logs)
>   File 
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py",
>  line 104, in _monitor_pod
>     for line in logs:
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 747, in __iter__
>     for chunk in self.stream(decode_content=True):
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 527, in stream
>     for line in self.read_chunked(amt, decode_content=decode_content):
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 732, in read_chunked
>     self._original_response.close()
>   File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
>     self.gen.throw(type, value, traceback)
>   File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 
> 415, in _error_catcher
>     raise ProtocolError('Connection broken: %r' % e, e)
> urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes 
> read)', IncompleteRead(0 bytes read))
> [2019-09-17 20:50:02,536] {taskinstance.py:1070} INFO - Marking task as 
> UP_FOR_RETRY
> [2019-09-17 20:50:02,717] {logging_mixin.py:95} INFO - [2019-09-17 
> 20:50:02,717] {email.py:126} INFO - Sent an alert email to 
> ['[email protected]', '[email protected]']
> [2019-09-17 20:50:03,921] {base_task_runner.py:115} INFO - Job 16307: Subtask 
> ods_senseid_log_ingestion [2019-09-17 20:50:03,921] 
> {cli_action_loggers.py:82} DEBUG - Calling callbacks: []
> [2019-09-17 20:50:03,925] {base_task_runner.py:115} INFO - Job 16307: Subtask 
> ods_senseid_
> {code}
> I have add a try-catch over for log loop, and it works fine. Below is trace 
> for exception catch:
> {code}
> ProtocolError('Connection broken: IncompleteRead(0 bytes read)', 
> IncompleteRead(0 bytes 
> read))ods-senseid-img-identity-liven-ods-senseid-img-identity-liven-83570f1fGetting
>  log error, stop getting log.
> ProtocolError('Connection broken: IncompleteRead(0 bytes read)', 
> IncompleteRead(0 bytes 
> read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--885013e7Getting
>  log error, stop getting log.
> ProtocolError('Connection broken: IncompleteRead(0 bytes read)', 
> IncompleteRead(0 bytes 
> read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--9627184eGetting
>  log error, stop getting log.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (AIRFLOW-5571) Kubernetes operator's bug that get logs will make pod exit unexpectedly

Reply via email to