[ 
https://issues.apache.org/jira/browse/AIRFLOW-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Hartig updated AIRFLOW-5881:
----------------------------------
    Description: 
Running with the KubernetesExecutor in and AKS cluster, when we upgraded to 
version 1.10.6 we noticed that the all the Dags stop making progress but start 
running and immediate exiting with the following message:

"Instance State' FAILED: Task is in the 'scheduled' state which is not a valid 
state for execution. The task must be cleared in order to be run."

See attached log file for the worker. Nothing seems out of the ordinary in the 
Scheduler log. 

Reverting to 1.10.5 clears the problem.

Note that at the time of the failure maybe 100 or so tasks are in this state, 
with 70 coming from one highly parallelized dag. Clearing the scheduled tasks 
just makes them reappear shortly thereafter. Marking them "up_for_retry" 
results in one being executed but then the system is stuck in the original 
zombie state. 

Attached is the also a redacted airflow config flag. 

  was:
Running with the KubernetesExecutor in and AKS cluster, when we upgraded to 
version 1.10.6 we noticed that the all the Dags stop making progress but start 
running and immediate exiting with the following message:

"Instance State' FAILED: Task is in the 'scheduled' state which is not a valid 
state for execution. The task must be cleared in order to be run."

See attached log file for the worker. Nothing seems out of the ordinary in the 
Scheduler log. 

Reverting to 1.10.5 clears the problem.

Note that at the time of the failure maybe 100 or so tasks are in this state, 
with 70 coming from one highly parallelized dag. Clearing the scheduled tasks 
just makes them reappear shortly thereafter. Marking them "up_for_retry" 
results in that one being executed but then the system is stuck in the original 
zombie state. 

Attached is the also a redacted airflow config flag. 


> Dag gets stuck in "Scheduled" State when scheduling a large number of tasks
> ---------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5881
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5881
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 1.10.6
>            Reporter: David Hartig
>            Priority: Critical
>         Attachments: 2 (1).log, airflow.cnf
>
>
> Running with the KubernetesExecutor in and AKS cluster, when we upgraded to 
> version 1.10.6 we noticed that the all the Dags stop making progress but 
> start running and immediate exiting with the following message:
> "Instance State' FAILED: Task is in the 'scheduled' state which is not a 
> valid state for execution. The task must be cleared in order to be run."
> See attached log file for the worker. Nothing seems out of the ordinary in 
> the Scheduler log. 
> Reverting to 1.10.5 clears the problem.
> Note that at the time of the failure maybe 100 or so tasks are in this state, 
> with 70 coming from one highly parallelized dag. Clearing the scheduled tasks 
> just makes them reappear shortly thereafter. Marking them "up_for_retry" 
> results in one being executed but then the system is stuck in the original 
> zombie state. 
> Attached is the also a redacted airflow config flag. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to