[ 
https://issues.apache.org/jira/browse/AIRFLOW-5171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt C. Wilson updated AIRFLOW-5171:
------------------------------------
    Description: 
We are experiencing an issue similar to that reported in AIRFLOW-1641 and 
AIRFLOW-4586.  We run two parallel dags, both using a common set of pools, both 
using LocalExecutor.

What we are seeing is once every couple dozen dag runs, a task will reach the 
`queued` status and not continue into a `running` state once a pool slot is 
open / dependencies are filled.

Investigating the task instance details confirms the same; Airflow reports that 
it expects the task to commence shortly once resources are available.  See 
attachment. [^Airflow - Task Instance Details.htm]

While tasks are in this state, the sibling parallel dag is able to flow 
completely, even multiple times through.  So we know the issue is not with pool 
constraints, executor issues, etc.  The problem really seems to be that Airflow 
has simply lost track of the task and failed to start it.

Clearing the task state has no effect - the task does not get moved back into a 
`scheduled` or `queued` or `running` state, it just stays at the `none` state.  
The task must be marked as `failed` or `success` to resume normal dag flow.

This issue has been causing sporadic production degradation for us, with no 
obvious avenue for troubleshooting.  It's not clear if changing the 
`dagbag_import_timeout` (as reported in 1641) will help because our task has no 
log showing in the Airflow UI.   See screenshot.  

I'm open to all recommendations to try to get to the bottom of this.  Please 
let me know if there is any log data or other info I can provide.

 

  was:
We are experiencing an issue similar to that reported in AIRFLOW-1641 and 
AIRFLOW-4586.  We run two parallel dags, both using a common set of pools, both 
using LocalExecutor.

What we are seeing is once every couple dozen dag runs, a task will reach the 
`queued` status and not continue into a `running` state once a pool slot is 
open / dependencies are filled.

Investigating the task instance details confirms the same; Airflow reports that 
it expects the task to commence shortly once resources are available.  See 
attachment. [^Airflow - Task Instance Details.htm]

While tasks are in this state, the sibling parallel dag is able to flow 
completely, even multiple times through.  So we know the issue is not with pool 
constraints, executor issues, etc.  The problem really seems to be that Airflow 
has simply lost track of the task and failed to start it.

Clearing the task state has no effect - the task does not get moved back into a 
`scheduled` or `queued` or `running` state, it just stays at the `none` state.  
The task must be marked as `failed` or `success` to resume normal dag flow.

This issue has been causing sporadic production degradation for us, with no 
obvious avenue for troubleshooting.  It's not clear if changing the 
`dagbag_import_timeout` (as reported in 1641) will help because our task has no 
log showing in the Airflow UI.   See screenshot.   !Airflow - Log.png!

I'm open to all recommendations to try to get to the bottom of this.  Please 
let me know if there is any log data or other info I can provide.

 


> Random task gets stuck in queued state despite all dependencies met
> -------------------------------------------------------------------
>
>                 Key: AIRFLOW-5171
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5171
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors, scheduler
>    Affects Versions: 1.10.2
>            Reporter: Matt C. Wilson
>            Priority: Major
>         Attachments: Airflow - Task Instance Details.htm
>
>
> We are experiencing an issue similar to that reported in AIRFLOW-1641 and 
> AIRFLOW-4586.  We run two parallel dags, both using a common set of pools, 
> both using LocalExecutor.
> What we are seeing is once every couple dozen dag runs, a task will reach the 
> `queued` status and not continue into a `running` state once a pool slot is 
> open / dependencies are filled.
> Investigating the task instance details confirms the same; Airflow reports 
> that it expects the task to commence shortly once resources are available.  
> See attachment. [^Airflow - Task Instance Details.htm]
> While tasks are in this state, the sibling parallel dag is able to flow 
> completely, even multiple times through.  So we know the issue is not with 
> pool constraints, executor issues, etc.  The problem really seems to be that 
> Airflow has simply lost track of the task and failed to start it.
> Clearing the task state has no effect - the task does not get moved back into 
> a `scheduled` or `queued` or `running` state, it just stays at the `none` 
> state.  The task must be marked as `failed` or `success` to resume normal dag 
> flow.
> This issue has been causing sporadic production degradation for us, with no 
> obvious avenue for troubleshooting.  It's not clear if changing the 
> `dagbag_import_timeout` (as reported in 1641) will help because our task has 
> no log showing in the Airflow UI.   See screenshot.  
> I'm open to all recommendations to try to get to the bottom of this.  Please 
> let me know if there is any log data or other info I can provide.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to