[ 
https://issues.apache.org/jira/browse/AIRFLOW-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867215#comment-16867215
 ] 

Christian Lellmann commented on AIRFLOW-4526:
---------------------------------------------

Hi [~ash], unfortunately, it isn't that easy to provide a minimal reproduction 
DAG example with all additional components. Here is what I exactly did to get 
this bug:
 * Run puckel docker image 1.10.2 in an Azure Kubernetes Cluster in Celery Mode 
based on version 2.4.4 of stable helm charts 
([https://github.com/helm/charts/tree/master/stable/airflow])
 * Create a DAG with a KubernetesPodOperator in in_cluster mode and with 
get_logs parameter set to True
 * With this operator run a docker container that first logs some things and 
runs subsequently into a thread that is taking a long time to process before 
writing more logs
 * You will face the KubernetesPodOperator to not get any logs anymore (Task 
doesn't log anymore) and to get stuck

I hope this helps a bit. I'm sorry that I cannot provide a full example code.

> KubernetesPodOperator gets stuck in Running state when get_logs is set to 
> True and there is a long gap without logs from pod
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4526
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4526
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: operators
>         Environment: Azure Kubernetes Service cluster with Airflow based on 
> puckel/docker-airflow
>            Reporter: Christian Lellmann
>            Priority: Major
>              Labels: kubernetes
>             Fix For: 1.10.4
>
>
> When setting the `get_logs` parameter in the KubernetesPodOperator to True 
> the Operator task get stuck in the Running state if the pod that is run by 
> the task (in_cluster mode) writes some logs and then stops writing logs for a 
> longer time (few minutes) before continuing writing. The continued logging 
> isn't fetched anymore and the pod states aren't checked anymore. So, the 
> completion of the pod isn't recognized and the tasks never finishes.
>  
> Assumption:
> In the `monitor_pod` method of the pod launcher 
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L97])
>  the `read_namespaced_pod_log` method of the kubernetes client get stuck in 
> the `Follow=True` stream 
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L108])
>  because if there is a time without logs from the pod the method doesn't 
> forward the following logs anymore, probably.
> So, the `pod_launcher` doesn't check the pod states later anymore 
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L118])
>  and doesn't recognize the complete state -> the task sticks in Running.
> When disabling the `get_logs` parameter everything works because the log 
> stream is skipped.
>  
> Suggestion:
> Poll the logs actively without the `Follow` parameter set to True in parallel 
> with the pod state checking.
> So, it's possible to fetch the logs without the described connection problem 
> and coincidently check the pod state to be definetly able to recognize the 
> end states of the pods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to