[
https://issues.apache.org/jira/browse/AIRFLOW-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023320#comment-17023320
]
William Lachance commented on AIRFLOW-5571:
-------------------------------------------
I am 99% certain this is a duplicate of AIRFLOW-3534, but IMO the bug is a
little more clearly described there (I think the problem is really that we're
not timing out correctly for this case, which I'm about to file a PR for). I'd
personally recommend closing this one.
> Kubernetes operator's bug that get logs will make pod exit unexpectedly
> -----------------------------------------------------------------------
>
> Key: AIRFLOW-5571
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5571
> Project: Apache Airflow
> Issue Type: Bug
> Components: operators
> Affects Versions: 1.10.0, 1.10.1, 1.10.2, 1.10.3
> Reporter: Liu Xuesi
> Priority: Critical
> Labels: kubernetes, operator
> Attachments: k8s_error_log.png
>
>
> In my project, I manage 200+ jobs in kubernetes, as I work on, I found a
> critical bug for kubernetes operator.
> In pod_launcher.py, *_monitor_pod* function:
> {code:python}
> if get_logs:
> logs = self.read_pod_logs(pod) # Here has a retry.
> for line in logs: # But exception throw from here!
> self.
> {code}
> in above code, *logs* is a HttpResponse, as it implemented __iter__()
> function, you can use for loop to print the lines. In the other words, here
> use a http long connection to get endless log.
> There is only try catch over *self.read_pod_logs*, however, If the network is
> disconnected or jitter occurs, for loop will throw error.
> As I have 200+ job run everyday, I can get 4~5 errors everyday, and each
> error will let monitor think the pod is down, and then mark the task as
> failed, then retry it. This eventually lead to data error.
>
> Below is a typical error log:
> {code:java}
> [2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17
> 20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:00 INFO
> yarn.Client: Application report for application_1565926539066_3866 (state:
> RUNNING)\n'
> [2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17
> 20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:01 INFO
> yarn.Client: Application report for application_1565926539066_3866 (state:
> RUNNING)\n'
> [2019-09-17 20:50:02,533] {taskinstance.py:1047} ERROR - ('Connection broken:
> IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
> Traceback (most recent call last):
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 639, in _update_chunk_length
> self.chunk_left = int(line, 16)
> ValueError: invalid literal for int() with base 16: b''
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 397, in _error_catcher
> yield
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 704, in read_chunked
> self._update_chunk_length()
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 643, in _update_chunk_length
> raise httplib.IncompleteRead(line)
> http.client.IncompleteRead: IncompleteRead(0 bytes read)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File
> "/usr/local/lib/python3.6/dist-packages/airflow/models/taskinstance.py", line
> 922, in _run_raw_task
> result = task_copy.execute(context=context)
> File
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/k8s_pod_operator.py",
> line 45, in execute
> super().execute(context)
> File
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/kubernetes_pod_operator.py",
> line 148, in execute
> get_logs=self.get_logs)
> File
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py",
> line 97, in run_pod
> return self._monitor_pod(pod, get_logs)
> File
> "/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py",
> line 104, in _monitor_pod
> for line in logs:
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 747, in __iter__
> for chunk in self.stream(decode_content=True):
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 527, in stream
> for line in self.read_chunked(amt, decode_content=decode_content):
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 732, in read_chunked
> self._original_response.close()
> File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
> self.gen.throw(type, value, traceback)
> File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line
> 415, in _error_catcher
> raise ProtocolError('Connection broken: %r' % e, e)
> urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes
> read)', IncompleteRead(0 bytes read))
> [2019-09-17 20:50:02,536] {taskinstance.py:1070} INFO - Marking task as
> UP_FOR_RETRY
> [2019-09-17 20:50:02,717] {logging_mixin.py:95} INFO - [2019-09-17
> 20:50:02,717] {email.py:126} INFO - Sent an alert email to
> ['[email protected]', '[email protected]']
> [2019-09-17 20:50:03,921] {base_task_runner.py:115} INFO - Job 16307: Subtask
> ods_senseid_log_ingestion [2019-09-17 20:50:03,921]
> {cli_action_loggers.py:82} DEBUG - Calling callbacks: []
> [2019-09-17 20:50:03,925] {base_task_runner.py:115} INFO - Job 16307: Subtask
> ods_senseid_
> {code}
> I have add a try-catch over for log loop, and it works fine. Below is trace
> for exception catch:
> {code}
> ProtocolError('Connection broken: IncompleteRead(0 bytes read)',
> IncompleteRead(0 bytes
> read))ods-senseid-img-identity-liven-ods-senseid-img-identity-liven-83570f1fGetting
> log error, stop getting log.
> ProtocolError('Connection broken: IncompleteRead(0 bytes read)',
> IncompleteRead(0 bytes
> read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--885013e7Getting
> log error, stop getting log.
> ProtocolError('Connection broken: IncompleteRead(0 bytes read)',
> IncompleteRead(0 bytes
> read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--9627184eGetting
> log error, stop getting log.
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)