[ https://issues.apache.org/jira/browse/AIRFLOW-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876365#comment-16876365 ]
ASF GitHub Bot commented on AIRFLOW-4797: ----------------------------------------- seelmann commented on pull request #5511: [AIRFLOW-4797] Fix zombie detection URL: https://github.com/apache/airflow/pull/5511 ### Jira - [X] My PR addresses the following [Airflow Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references them in the PR title. - https://issues.apache.org/jira/browse/AIRFLOW-4797 ### Description - [X] Here are some details about my PR, including screenshots of any UI changes: Moved query to fetch zombies from `DagFileProcessorManager` to `DagBag` class. Changed query to only look for DAGs of the current DAG bag. The query now uses index `ti_dag_state` instead of `ti_state`. Removed no longer required `zombies` parameters from many function signatures. The query is now executed on every call to `DagBag.kill_zombies` which is called when the DAG file is processed which frequency depends on `scheduler_heartbeat_sec` and `processor_poll_interval` (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed. Tested on our staging environment (patch applied to Airflow 1.10.3), zombie detection works fine, database load is unchanged. Here some stats from `pg_stat_statements`, the branch run there for 4 hours: The new query (1st line) is faster but is likely called more frequently. The 2nd line shows stats of the old query. ``` select calls,mean_time,max_time,rows from pg_stat_statements where query like '%task_instance JOIN job%' and query like '%latest_heartbeat%'; calls | mean_time | max_time | rows ----------+--------------------+-------------+------ 55416 | 0.0260821553522449 | 5.509762 | 29 71969011 | 0.575755060854888 | 1078.895322 | 2377 ``` Closed https://github.com/apache/airflow/pull/5420 in favour of this. ### Tests - [X] My PR adds the following unit tests ### Commits - [X] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain docstrings that explain what it does - If you implement backwards incompatible changes, please leave a note in the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so we can assign it to a appropriate release ### Code Quality - [X] Passes `flake8` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Zombie detection and killing is not deterministic > ------------------------------------------------- > > Key: AIRFLOW-4797 > URL: https://issues.apache.org/jira/browse/AIRFLOW-4797 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler > Affects Versions: 1.10.3 > Reporter: Stefan Seelmann > Assignee: Stefan Seelmann > Priority: Major > > Zombie detection and killing is done within the DAG file processing loop. > Within one iteration only a subset of the DAG files are processed (config > scheduler.max_threads). The loop sleeps for the rest of the second, until the > next iteration runs which processes the next subset of DAG files. The > function to get zombie task instancs only returns zombies once within 10 > seconds, otherwise an empty list is returned. > That means only in every 10th iteration of the DAG file processing loop > zombies are detected. And only if the zombie task belong to one of the DAG > files of the current iteration they are killed. > We run into the worst case scenario with max_threads=2 and 20 DAGs. In such a > scenario only zombies of the same 2 DAGs are killed. (as loop iterations are > not exactly 1s it shifts slowly and eventually the zomies are killed, but in > one example it took 33 minutes). -- This message was sent by Atlassian JIRA (v7.6.3#76005)