[ 
https://issues.apache.org/jira/browse/AIRFLOW-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17310491#comment-17310491
 ] 

Anish Patel edited comment on AIRFLOW-6810 at 3/29/21, 8:38 AM:
----------------------------------------------------------------

Was this issue ever resolved? We are facing a similar (or may be even the same) 
issue where the worker pod created by the Kube operator remains in 'RUNNING' 
state for a long period of time (infinite loop) after the task script finishes 
execution. We had to kill the worker pod (using kubectl) which caused the task 
to unblock and the DAG to proceed to the next step. we are on version 1.10.6 
and would like to understand if there is a fix available in a later version?


was (Author: anishpatel14):
Was this issue ever resolved? We are facing a similar (or may be even the same) 
issue where the worker pod created by the Kube operator remains in 'RUNNING' 
state for a long period of time (infinite loop) after the task script finishes 
execution. We had to kill the worker pod in production (using kubectl) which 
caused the task to unblock and the DAG to proceed to the next step. we are on 
version 1.10.6 and would like to understand if there is a fix available in a 
later version?

> KubernetesPodOperator pod is completed but xcom side car is stuck
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-6810
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6810
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executor-kubernetes
>    Affects Versions: 1.10.6
>            Reporter: Maxence Cramet
>            Assignee: Daniel Imberman
>            Priority: Major
>
> We're using KubernetesPodOperator with param xcom_push=true in order to push 
> information from our task.
> From time to time the main pod completes but the side car pod is stuck.
> Here's the output of the pods details:
> {noformat}
> kubectl describe pod my_pod
> Name:               my_pod
> Namespace:          default
> Priority:           0
> PriorityClassName:  <none>
> Node:               xxx
> Start Time:         Wed, 05 Feb 2020 11:12:33 +0000
> Labels:             xxx
> Annotations:        xxx
> Status:             Running
> IP:                 xxx
> Containers:
>   base:
>     Container ID:  xxx
>     Image:         xxx
>     Image ID:      xxx
>     Port:          <none>
>     Host Port:     <none>
>     Args:
>       xxx
>     State:          Terminated
>       Reason:       Completed
>       Exit Code:    0
>       Started:      Wed, 05 Feb 2020 11:12:38 +0000
>       Finished:     Wed, 05 Feb 2020 11:12:47 +0000
>     Ready:          False
>     Restart Count:  0
>     Limits:
>       memory:  512Mi
>     Requests:
>       memory:  512Mi
>     Environment:
>       xxx
>     Mounts:
>       /airflow/xcom from xcom (rw)
>   airflow-xcom-sidecar:
>     Container ID:  
> docker://83053d7d292cda9156454ac13064d72ace1e4f72738ba9b62b04ff57cb7966cc
>     Image:         alpine
>     Image ID:      
> docker-pullable://alpine@sha256:ab00606a42621fb68f2ed6ad3c88be54397f981a7b70a79db3d1172b11c4367d
>     Port:          <none>
>     Host Port:     <none>
>     Command:
>       sh
>       -c
>       trap "exit 0" INT; while true; do sleep 30; done;
>     State:          Running
>       Started:      Wed, 05 Feb 2020 11:12:40 +0000
>     Ready:          True
>     Restart Count:  0
>     Limits:
>       memory:  4Gi
>     Requests:
>       cpu:        1m
>       memory:     2Gi
>     Environment:  <none>
>     Mounts:
>       /airflow/xcom from xcom (rw)
>       xxx
> Conditions:
>   Type              Status
>   Initialized       True 
>   Ready             False 
>   ContainersReady   False 
>   PodScheduled      True 
> Volumes:
>   xcom:
>     Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
>     Medium:     
>     SizeLimit:  <unset>
>   xxx
> QoS Class:       Burstable
> Node-Selectors:  <none>
> Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
>                  node.kubernetes.io/unreachable:NoExecute for 300s
> Events:          <none>{noformat}
> I don't have more information of the possible causes of that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to