Stefan Seelmann created AIRFLOW-4797:
----------------------------------------

             Summary: Zombie detection and killing is not deterministic
                 Key: AIRFLOW-4797
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4797
             Project: Apache Airflow
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 1.10.3
            Reporter: Stefan Seelmann
            Assignee: Stefan Seelmann


Zombie detection and killing is done within the DAG file processing loop. 
Within one iteration only a subset of the DAG files are processed (config 
scheduler.max_threads). The loop sleeps for the rest of the second, until the 
next iteration runs which processes the next subset of DAG files. The function 
to get zombie task instancs only returns zombies once within 10 seconds, 
otherwise an empty list is returned.

That means only in every 10th iteration of the DAG file processing loop zombies 
are detected. And only if the zombie task belong to one of the DAG files of the 
current iteration they are killed.

We run into a very unfortunate scenario with max_threads=2 and 20 DAGs. In such 
a scenario only zombies of the same 2 DAGs are killed. (as loop iterations are 
not exactly 1s eventually it shifts slowly and eventually the zomies are 
killed, but in one example it took 33 minutes).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to