[ 
https://issues.apache.org/jira/browse/AIRFLOW-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885213#comment-16885213
 ] 

ASF subversion and git services commented on AIRFLOW-4797:
----------------------------------------------------------

Commit 5842247c90ab7c96e47ccece891b5c3e65acd88c in airflow's branch 
refs/heads/v1-10-test from Stefan Seelmann
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=5842247 ]

[AIRFLOW-4797] Improve performance and behaviour of zombie detection (#5511)

Moved query to fetch zombies from DagFileProcessorManager to DagBag class. 
Changed query to only look for DAGs of the current DAG bag. The query now uses 
index ti_dag_state instead of ti_state. Removed no longer required zombies 
parameters from many function signatures.

The query is now executed on every call to DagBag.kill_zombies which is called 
when the DAG file is processed which frequency depends on 
scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is 
faster than the previous one (see also stats below). It's also negligible IMHO 
because during DAG file processing many other queries (DAG runs and task 
instances are created, task instance dependencies are checked) are executed.

(cherry picked from commit 2bdb053db618de7064b527e6e3ebe29f220d857b)


> Zombie detection and killing is not deterministic
> -------------------------------------------------
>
>                 Key: AIRFLOW-4797
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4797
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.3
>            Reporter: Stefan Seelmann
>            Assignee: Stefan Seelmann
>            Priority: Major
>
> Zombie detection and killing is done within the DAG file processing loop. 
> Within one iteration only a subset of the DAG files are processed (config 
> scheduler.max_threads). The loop sleeps for the rest of the second, until the 
> next iteration runs which processes the next subset of DAG files. The 
> function to get zombie task instancs only returns zombies once within 10 
> seconds, otherwise an empty list is returned.
> That means only in every 10th iteration of the DAG file processing loop 
> zombies are detected. And only if the zombie task belong to one of the DAG 
> files of the current iteration they are killed.
> We run into the worst case scenario with max_threads=2 and 20 DAGs. In such a 
> scenario only zombies of the same 2 DAGs are killed. (as loop iterations are 
> not exactly 1s it shifts slowly and eventually the zomies are killed, but in 
> one example it took 33 minutes).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to