[ 
https://issues.apache.org/jira/browse/AIRFLOW-4797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864778#comment-16864778
 ] 

ASF GitHub Bot commented on AIRFLOW-4797:
-----------------------------------------

seelmann commented on pull request #5420: [AIRFLOW-4797] Fix zombie detection
URL: https://github.com/apache/airflow/pull/5420
 
 
   ### Jira
   
   - [X] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. 
     - https://issues.apache.org/jira/browse/AIRFLOW-4797
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI 
changes:
     - Fix zombie detection and killing by removing the condition that returned 
zombie task instances only once within 10 seconds. The method is called only 
once per second anyway because the loop sleeps if it's faster than one second. 
The executed query uses indexes.
   
   ### Tests
   
   - [X] My PR adds the following unit tests:
     - Adapted existing test
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
     - All the public functions and the classes in the PR contain docstrings 
that explain what it does
     - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Zombie detection and killing is not deterministic
> -------------------------------------------------
>
>                 Key: AIRFLOW-4797
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4797
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.3
>            Reporter: Stefan Seelmann
>            Assignee: Stefan Seelmann
>            Priority: Major
>
> Zombie detection and killing is done within the DAG file processing loop. 
> Within one iteration only a subset of the DAG files are processed (config 
> scheduler.max_threads). The loop sleeps for the rest of the second, until the 
> next iteration runs which processes the next subset of DAG files. The 
> function to get zombie task instancs only returns zombies once within 10 
> seconds, otherwise an empty list is returned.
> That means only in every 10th iteration of the DAG file processing loop 
> zombies are detected. And only if the zombie task belong to one of the DAG 
> files of the current iteration they are killed.
> We run into a very unfortunate scenario with max_threads=2 and 20 DAGs. In 
> such a scenario only zombies of the same 2 DAGs are killed. (as loop 
> iterations are not exactly 1s it shifts slowly and eventually the zomies are 
> killed, but in one example it took 33 minutes).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to