[ 
https://issues.apache.org/jira/browse/AIRFLOW-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876365#comment-16876365
 ] 

ASF GitHub Bot commented on AIRFLOW-4797:
-----------------------------------------

seelmann commented on pull request #5511: [AIRFLOW-4797] Fix zombie detection
URL: https://github.com/apache/airflow/pull/5511
 
 
   ### Jira
   
   - [X] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title.
     - https://issues.apache.org/jira/browse/AIRFLOW-4797
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI 
changes:
   
   Moved query to fetch zombies from `DagFileProcessorManager` to `DagBag` 
class. Changed query to only look for DAGs of the current DAG bag. The query 
now uses index `ti_dag_state` instead of `ti_state`. Removed no longer required 
`zombies` parameters from many function signatures.
    
   The query is now executed on every call to `DagBag.kill_zombies` which is 
called when the DAG file is processed which frequency depends on 
`scheduler_heartbeat_sec` and `processor_poll_interval` (AFAIU). The query is 
faster than the previous one (see also stats below). It's also negligible IMHO 
because during DAG file processing many other queries (DAG runs and task 
instances are created, task instance dependencies are checked) are executed.
   
   Tested on our staging environment (patch applied to Airflow 1.10.3), zombie 
detection works fine, database load is unchanged. Here some stats from 
`pg_stat_statements`, the branch run there for 4 hours: The new query (1st 
line) is faster but is likely called more frequently. The 2nd line shows stats 
of the old query.
   ```
   select calls,mean_time,max_time,rows from pg_stat_statements where query 
like '%task_instance JOIN job%' and query like '%latest_heartbeat%';
     calls   |     mean_time      |  max_time   | rows 
   ----------+--------------------+-------------+------
       55416 | 0.0260821553522449 |    5.509762 |   29
    71969011 |  0.575755060854888 | 1078.895322 | 2377
   ```
   
   Closed https://github.com/apache/airflow/pull/5420 in favour of this.
   
   ### Tests
   
   - [X] My PR adds the following unit tests
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
     - All the public functions and the classes in the PR contain docstrings 
that explain what it does
     - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Zombie detection and killing is not deterministic
> -------------------------------------------------
>
>                 Key: AIRFLOW-4797
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4797
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.3
>            Reporter: Stefan Seelmann
>            Assignee: Stefan Seelmann
>            Priority: Major
>
> Zombie detection and killing is done within the DAG file processing loop. 
> Within one iteration only a subset of the DAG files are processed (config 
> scheduler.max_threads). The loop sleeps for the rest of the second, until the 
> next iteration runs which processes the next subset of DAG files. The 
> function to get zombie task instancs only returns zombies once within 10 
> seconds, otherwise an empty list is returned.
> That means only in every 10th iteration of the DAG file processing loop 
> zombies are detected. And only if the zombie task belong to one of the DAG 
> files of the current iteration they are killed.
> We run into the worst case scenario with max_threads=2 and 20 DAGs. In such a 
> scenario only zombies of the same 2 DAGs are killed. (as loop iterations are 
> not exactly 1s it shifts slowly and eventually the zomies are killed, but in 
> one example it took 33 minutes).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to