[ 
https://issues.apache.org/jira/browse/AIRFLOW-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891343#comment-15891343
 ] 

Vijay Krishna Ramesh edited comment on AIRFLOW-931 at 3/2/17 12:06 AM:
-----------------------------------------------------------------------

[~bolke] I added a bunch of manual logging to verify this. What seems to happen 
(in my simple gist example) is DAG starts with 4 tasks. The multiple scheduler 
processes kick them off (the jobs.py:1095) because at that time they are all 
runnable (since none are yet running). That models.py:1291 does another check 
to make sure it is actually runnable (2 of them are, but then due to the dag 
concurrency of 2, the third and fourth one aren't). The 2 tasks stay marked as 
queued but never actually get picked up by the scheduler again (unless you 
restart it). Is there some other state that that models.py:1291 check could 
move the tasks to if that last minute check found they aren't actually 
runnable? (I found by making them State.NONE it worked, but that seems hackish 
as it keeps bouncing back and forth between QUEUED and NONE until it can 
actually run)


was (Author: v_krishna):
[~bolke] I added a bunch of manual logging to verify this. What seems to happen 
(in my simple gist example) is DAG starts with 4 tasks. The multiple scheduler 
processes kick them off (the jobs.py:1095) because at that time they are all 
runnable (since none are yet running). That models.py:1291 does another check 
to make sure it is actually runnable (2 of them are, but then due to the dag 
concurrency of 2, the third and fourth one aren't). The 2 tasks stay marked as 
queued but never actually get picked up by the scheduler again (unless you 
restart it). Is there some other state that that models.py:1291 check could 
move the tasks too if that last minute check found they aren't actually 
runnable? (I found by making them State.NONE it worked, but that seems hackish 
as it keeps bouncing back and forth between QUEUED and NONE until it can 
actually run)

> LocalExecutor fails to run queued task with race condition
> ----------------------------------------------------------
>
>                 Key: AIRFLOW-931
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-931
>             Project: Apache Airflow
>          Issue Type: Bug
>    Affects Versions: Airflow 1.8, 1.8.0rc4
>            Reporter: Vijay Krishna Ramesh
>
> https://gist.github.com/vijaykramesh/707262c83429ab2a3f5ee701879813e3 
> provides a small example that consistently hits this problem with 
> LocalExecutor.
> Basically when the dag run kicks off (with concurrency > 1) and a 
> LocalExecutor with parallelism > 2 the scheduler marks more than concurrency 
> tasks as queued 
> (https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L1095)
> There is a second check before actually running the task 
> (https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1291)
>  that leaves the task in the QUEUED state but then the scheduler never picks 
> it back up.  This causes the DAG to get stuck (as the queued tasks never run) 
> until the scheduler is restarted (at which point the enqueued tasks are 
> considered orphaned, the status is set to NONE, and then they are picked up 
> by the scheduler again and run.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to