[
https://issues.apache.org/jira/browse/AIRFLOW-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867215#comment-16867215
]
Christian Lellmann commented on AIRFLOW-4526:
---------------------------------------------
Hi [~ash], unfortunately, it isn't that easy to provide a minimal reproduction
DAG example with all additional components. Here is what I exactly did to get
this bug:
* Run puckel docker image 1.10.2 in an Azure Kubernetes Cluster in Celery Mode
based on version 2.4.4 of stable helm charts
([https://github.com/helm/charts/tree/master/stable/airflow])
* Create a DAG with a KubernetesPodOperator in in_cluster mode and with
get_logs parameter set to True
* With this operator run a docker container that first logs some things and
runs subsequently into a thread that is taking a long time to process before
writing more logs
* You will face the KubernetesPodOperator to not get any logs anymore (Task
doesn't log anymore) and to get stuck
I hope this helps a bit. I'm sorry that I cannot provide a full example code.
> KubernetesPodOperator gets stuck in Running state when get_logs is set to
> True and there is a long gap without logs from pod
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-4526
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4526
> Project: Apache Airflow
> Issue Type: Bug
> Components: operators
> Environment: Azure Kubernetes Service cluster with Airflow based on
> puckel/docker-airflow
> Reporter: Christian Lellmann
> Priority: Major
> Labels: kubernetes
> Fix For: 1.10.4
>
>
> When setting the `get_logs` parameter in the KubernetesPodOperator to True
> the Operator task get stuck in the Running state if the pod that is run by
> the task (in_cluster mode) writes some logs and then stops writing logs for a
> longer time (few minutes) before continuing writing. The continued logging
> isn't fetched anymore and the pod states aren't checked anymore. So, the
> completion of the pod isn't recognized and the tasks never finishes.
>
> Assumption:
> In the `monitor_pod` method of the pod launcher
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L97])
> the `read_namespaced_pod_log` method of the kubernetes client get stuck in
> the `Follow=True` stream
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L108])
> because if there is a time without logs from the pod the method doesn't
> forward the following logs anymore, probably.
> So, the `pod_launcher` doesn't check the pod states later anymore
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L118])
> and doesn't recognize the complete state -> the task sticks in Running.
> When disabling the `get_logs` parameter everything works because the log
> stream is skipped.
>
> Suggestion:
> Poll the logs actively without the `Follow` parameter set to True in parallel
> with the pod state checking.
> So, it's possible to fetch the logs without the described connection problem
> and coincidently check the pod state to be definetly able to recognize the
> end states of the pods.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)