[ 
https://issues.apache.org/jira/browse/AIRFLOW-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255292#comment-17255292
 ] 

lidor ettinger commented on AIRFLOW-4526:
-----------------------------------------

We encountered the same issue.

One of the approach we did was to send logs to s3.

[remote-base-log-folder|https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#remote-base-log-folder]

> KubernetesPodOperator gets stuck in Running state when get_logs is set to 
> True and there is a long gap without logs from pod
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4526
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4526
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: operators
>         Environment: Azure Kubernetes Service cluster with Airflow based on 
> puckel/docker-airflow
>            Reporter: Christian Lellmann
>            Priority: Major
>              Labels: kubernetes
>             Fix For: 2.0.0
>
>
> When setting the `get_logs` parameter in the KubernetesPodOperator to True 
> the Operator task get stuck in the Running state if the pod that is run by 
> the task (in_cluster mode) writes some logs and then stops writing logs for a 
> longer time (few minutes) before continuing writing. The continued logging 
> isn't fetched anymore and the pod states aren't checked anymore. So, the 
> completion of the pod isn't recognized and the tasks never finishes.
>  
> Assumption:
> In the `monitor_pod` method of the pod launcher 
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L97])
>  the `read_namespaced_pod_log` method of the kubernetes client get stuck in 
> the `Follow=True` stream 
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L108])
>  because if there is a time without logs from the pod the method doesn't 
> forward the following logs anymore, probably.
> So, the `pod_launcher` doesn't check the pod states later anymore 
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L118])
>  and doesn't recognize the complete state -> the task sticks in Running.
> When disabling the `get_logs` parameter everything works because the log 
> stream is skipped.
>  
> Suggestion:
> Poll the logs actively without the `Follow` parameter set to True in parallel 
> with the pod state checking.
> So, it's possible to fetch the logs without the described connection problem 
> and coincidently check the pod state to be definetly able to recognize the 
> end states of the pods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to