Liu Xuesi created AIRFLOW-5571:
----------------------------------

             Summary: Kubernetes operator's bug that get logs will make pod 
exit unexpectedly
                 Key: AIRFLOW-5571
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5571
             Project: Apache Airflow
          Issue Type: Bug
          Components: operators
    Affects Versions: 1.10.3, 1.10.2, 1.10.1, 1.10.0
            Reporter: Liu Xuesi
         Attachments: k8s_error_log.png

In my project, I manage 200+ jobs in kubernetes, as I work on, I found a 
critical bug for kubernetes operator.

In pod_launcher.py, *_monitor_pod* function:
{code:python}
        if get_logs:
            logs = self.read_pod_logs(pod) # Here has a retry.
            for line in logs:  # But exception throw from here!
                self.
{code}
in above code, *logs* is a HttpResponse, as it implemented __iter__() function, 
you can use for loop to print the lines. In the other words, here use a http 
long connection to get endless log.

There is only try catch over *self.read_pod_logs*, however, If the network is 
disconnected or jitter occurs, for loop will throw error.

As I have 200+ job run everyday, I can get 4~5 errors everyday, and each error 
will let monitor think the pod is down, and then mark the task as failed, then 
retry it. This eventually lead to data error.

 

Below is a typical error log:
{code:java}
 [2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17 
20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:00 INFO 
yarn.Client: Application report for application_1565926539066_3866 (state: 
RUNNING)\n'
[2019-09-17 20:50:02,532] {logging_mixin.py:95} INFO - [2019-09-17 
20:50:02,532] {pod_launcher.py:105} INFO - b'19/09/17 20:50:01 INFO 
yarn.Client: Application report for application_1565926539066_3866 (state: 
RUNNING)\n'
[2019-09-17 20:50:02,533] {taskinstance.py:1047} ERROR - ('Connection broken: 
IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 639, 
in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 397, 
in _error_catcher
    yield
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 704, 
in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 643, 
in _update_chunk_length
    raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/airflow/models/taskinstance.py", 
line 922, in _run_raw_task
    result = task_copy.execute(context=context)
  File 
"/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/k8s_pod_operator.py",
 line 45, in execute
    super().execute(context)
  File 
"/usr/local/lib/python3.6/dist-packages/airflow/contrib/operators/kubernetes_pod_operator.py",
 line 148, in execute
    get_logs=self.get_logs)
  File 
"/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py",
 line 97, in run_pod
    return self._monitor_pod(pod, get_logs)
  File 
"/usr/local/lib/python3.6/dist-packages/airflow/contrib/kubernetes/pod_launcher.py",
 line 104, in _monitor_pod
    for line in logs:
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 747, 
in __iter__
    for chunk in self.stream(decode_content=True):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 527, 
in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 732, 
in read_chunked
    self._original_response.close()
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/response.py", line 415, 
in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes 
read)', IncompleteRead(0 bytes read))
[2019-09-17 20:50:02,536] {taskinstance.py:1070} INFO - Marking task as 
UP_FOR_RETRY
[2019-09-17 20:50:02,717] {logging_mixin.py:95} INFO - [2019-09-17 
20:50:02,717] {email.py:126} INFO - Sent an alert email to 
['[email protected]', '[email protected]']
[2019-09-17 20:50:03,921] {base_task_runner.py:115} INFO - Job 16307: Subtask 
ods_senseid_log_ingestion [2019-09-17 20:50:03,921] {cli_action_loggers.py:82} 
DEBUG - Calling callbacks: []
[2019-09-17 20:50:03,925] {base_task_runner.py:115} INFO - Job 16307: Subtask 
ods_senseid_
{code}

I have add a try-catch over for log loop, and it works fine. Below is trace for 
exception catch:

{code}
ProtocolError('Connection broken: IncompleteRead(0 bytes read)', 
IncompleteRead(0 bytes 
read))ods-senseid-img-identity-liven-ods-senseid-img-identity-liven-83570f1fGetting
 log error, stop getting log.
ProtocolError('Connection broken: IncompleteRead(0 bytes read)', 
IncompleteRead(0 bytes 
read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--885013e7Getting
 log error, stop getting log.
ProtocolError('Connection broken: IncompleteRead(0 bytes read)', 
IncompleteRead(0 bytes 
read))ods-senseid-idnumber-log-shang-ods-senseid-identity-idnumber--9627184eGetting
 log error, stop getting log.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to