Daniel Cooper created AIRFLOW-5589:
--------------------------------------

             Summary: KubernetesPodOperator: Duplicate pods created on worker 
restart
                 Key: AIRFLOW-5589
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5589
             Project: Apache Airflow
          Issue Type: Bug
          Components: worker
    Affects Versions: 1.10.5, 1.10.4
            Reporter: Daniel Cooper
            Assignee: Daniel Cooper


K8sPodOperator holds state within the execute function that monitors the 
running pod. If a worker restarts for any reason (pod death, pod shuffle, 
upgrade etc.) then this state is lost.

At this point the scheduler notices (after max heartbeat interval wait) that 
the task is now 'zombie' (not monitored) and reschedules the task.

The new worker has no knowledge of the existing running pod and so creates a 
new duplicate pod.  This can lead to many duplicate pods for the same task 
running together in extreme cases.

I believe this is the problem Nicholas Brenwald (King) described as having when 
running k8s pod operator on Google Composer (at the September meetup at King).

My fix is to add enough labels to uniquely identify a running pod as being from 
a given task instance (dag_id, task_id, run_id).  We then do a namespaced list 
of pods from k8s with a label selector and monitor the existing pod if it 
exists otherwise we create a new one as normal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to