Stefan Seelmann created AIRFLOW-4797:
----------------------------------------
Summary: Zombie detection and killing is not deterministic
Key: AIRFLOW-4797
URL: https://issues.apache.org/jira/browse/AIRFLOW-4797
Project: Apache Airflow
Issue Type: Bug
Components: scheduler
Affects Versions: 1.10.3
Reporter: Stefan Seelmann
Assignee: Stefan Seelmann
Zombie detection and killing is done within the DAG file processing loop.
Within one iteration only a subset of the DAG files are processed (config
scheduler.max_threads). The loop sleeps for the rest of the second, until the
next iteration runs which processes the next subset of DAG files. The function
to get zombie task instancs only returns zombies once within 10 seconds,
otherwise an empty list is returned.
That means only in every 10th iteration of the DAG file processing loop zombies
are detected. And only if the zombie task belong to one of the DAG files of the
current iteration they are killed.
We run into a very unfortunate scenario with max_threads=2 and 20 DAGs. In such
a scenario only zombies of the same 2 DAGs are killed. (as loop iterations are
not exactly 1s eventually it shifts slowly and eventually the zomies are
killed, but in one example it took 33 minutes).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)