[ 
https://issues.apache.org/jira/browse/AIRFLOW-5171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998489#comment-16998489
 ] 

jack commented on AIRFLOW-5171:
-------------------------------

I can confirm this bug also exist in 1.10.3 + Local Executor.
Sadly for us it also seem to be random in nature.
I have yet been able to find common ground not to talk about example to 
reproduce :( 

> Random task gets stuck in queued state despite all dependencies met
> -------------------------------------------------------------------
>
>                 Key: AIRFLOW-5171
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5171
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors, scheduler
>    Affects Versions: 1.10.2
>            Reporter: Matt C. Wilson
>            Priority: Major
>         Attachments: Airflow - Log.png, Airflow - Task Instance Details.htm
>
>
> We are experiencing an issue similar to that reported in AIRFLOW-1641 and 
> AIRFLOW-4586.  We run two parallel dags, both using a common set of pools, 
> both using LocalExecutor.
> What we are seeing is once every couple dozen dag runs, a task will reach the 
> `queued` status and not continue into a `running` state once a pool slot is 
> open / dependencies are filled.
> Investigating the task instance details confirms the same; Airflow reports 
> that it expects the task to commence shortly once resources are available.  
> See attachment. [^Airflow - Task Instance Details.htm]
> While tasks are in this state, the sibling parallel dag is able to flow 
> completely, even multiple times through.  So we know the issue is not with 
> pool constraints, executor issues, etc.  The problem really seems to be that 
> Airflow has simply lost track of the task and failed to start it.
> Clearing the task state has no effect - the task does not get moved back into 
> a `scheduled` or `queued` or `running` state, it just stays at the `none` 
> state.  The task must be marked as `failed` or `success` to resume normal dag 
> flow.
> This issue has been causing sporadic production degradation for us, with no 
> obvious avenue for troubleshooting.  It's not clear if changing the 
> `dagbag_import_timeout` (as reported in 1641) will help because our task has 
> no log showing in the Airflow UI.   See screenshot.   !Airflow - Log.png!
> I'm open to all recommendations to try to get to the bottom of this.  Please 
> let me know if there is any log data or other info I can provide.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to