Matt C. Wilson created AIRFLOW-5171:
---------------------------------------

             Summary: Random task gets stuck in queued state despite all 
dependencies met
                 Key: AIRFLOW-5171
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5171
             Project: Apache Airflow
          Issue Type: Bug
          Components: executors, scheduler
    Affects Versions: 1.10.2
            Reporter: Matt C. Wilson
         Attachments: Airflow - Log.png, Airflow - Task Instance Details.htm

We are experiencing an issue similar to that reported in AIRFLOW-1641 and 
AIRFLOW-4586.  We run two parallel dags, both using a common set of pools, both 
using LocalExecutor.

What we are seeing is once every couple dozen dag runs, a task will reach the 
`queued` status and not continue into a `running` state once a pool slot is 
open / dependencies are filled.

Investigating the task instance details confirms the same; Airflow reports that 
it expects the task to commence shortly once resources are available.  See 
attachment. [^Airflow - Task Instance Details.htm]

While tasks are in this state, the sibling parallel dag is able to flow 
completely, even multiple times through.  So we know the issue is not with pool 
constraints, executor issues, etc.  The problem really seems to be that Airflow 
has simply lost track of the task and failed to start it.

Clearing the task state has no effect - the task does not get moved back into a 
`scheduled` or `queued` or `running` state, it just stays at the `none` state.  
The task must be marked as `failed` or `success` to resume normal dag flow.

This issue has been causing sporadic production degradation for us, with no 
obvious avenue for troubleshooting.  It's not clear if changing the 
`dagbag_import_timeout` (as reported in 1641) will help because our task has no 
log showing in the Airflow UI.   See screenshot.   !Airflow - Log.png!

I'm open to all recommendations to try to get to the bottom of this.  Please 
let me know if there is any log data or other info I can provide.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to