[
https://issues.apache.org/jira/browse/AIRFLOW-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17255292#comment-17255292
]
lidor ettinger edited comment on AIRFLOW-4526 at 12/27/20, 5:54 PM:
--------------------------------------------------------------------
We encountered the same issue.
One of the approaches we made was to send logs to s3.
[remote-base-log-folder|https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#remote-base-log-folder]
was (Author: lidor.ettinger):
We encountered the same issue.
One of the approach we did was to send logs to s3.
[remote-base-log-folder|https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#remote-base-log-folder]
> KubernetesPodOperator gets stuck in Running state when get_logs is set to
> True and there is a long gap without logs from pod
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-4526
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4526
> Project: Apache Airflow
> Issue Type: Bug
> Components: operators
> Environment: Azure Kubernetes Service cluster with Airflow based on
> puckel/docker-airflow
> Reporter: Christian Lellmann
> Priority: Major
> Labels: kubernetes
> Fix For: 2.0.0
>
>
> When setting the `get_logs` parameter in the KubernetesPodOperator to True
> the Operator task get stuck in the Running state if the pod that is run by
> the task (in_cluster mode) writes some logs and then stops writing logs for a
> longer time (few minutes) before continuing writing. The continued logging
> isn't fetched anymore and the pod states aren't checked anymore. So, the
> completion of the pod isn't recognized and the tasks never finishes.
>
> Assumption:
> In the `monitor_pod` method of the pod launcher
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L97])
> the `read_namespaced_pod_log` method of the kubernetes client get stuck in
> the `Follow=True` stream
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L108])
> because if there is a time without logs from the pod the method doesn't
> forward the following logs anymore, probably.
> So, the `pod_launcher` doesn't check the pod states later anymore
> ([https://github.com/apache/airflow/blob/master/airflow/kubernetes/pod_launcher.py#L118])
> and doesn't recognize the complete state -> the task sticks in Running.
> When disabling the `get_logs` parameter everything works because the log
> stream is skipped.
>
> Suggestion:
> Poll the logs actively without the `Follow` parameter set to True in parallel
> with the pod state checking.
> So, it's possible to fetch the logs without the described connection problem
> and coincidently check the pod state to be definetly able to recognize the
> end states of the pods.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)