[ 
https://issues.apache.org/jira/browse/AIRFLOW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhen Zhang updated AIRFLOW-1329:
--------------------------------
    Description: 
We see this weird issue in our production airflow cluster:

# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different 
PYTHONPATH settings such that the scheduler is able to parse the DAG 
successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the 
problematic DAG are in "queued" state, while on the scheduler side, the 
scheduler keeps requeue hundreds of thousands of duplicated tasks. As a result, 
it quickly saturates the worker queue and blocks normal tasks to run. 

I think a better way to handle this would be either mark the user task as 
failed, or the scheduler has some rate limit in requeueing tasks, and leave the 
cluster unaffected by user errors like this.
 

  was:
We see this weird issue in our production airflow cluster:

# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different 
PYTHONPATH settings such that the scheduler is able to parse the DAG 
successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the 
problematic DAG are in "queued" state, while on the scheduler side, the 
scheduler keeps requeue hundreds of thousands of tasks. As a result, it quickly 
saturates the worker queue and blocks normal tasks to run. 

I think a better way to handle this would be either mark the user task as 
failed, or the scheduler has some rate limit in requeueing tasks, and leave the 
cluster unaffected by user errors like this.
 


> Problematic DAG cause worker queue saturated
> --------------------------------------------
>
>                 Key: AIRFLOW-1329
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1329
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>            Reporter: Zhen Zhang
>
> We see this weird issue in our production airflow cluster:
> # User has a problematic import statement in DAG definition.
> # For some still unknown reasons, our scheduler and workers have different 
> PYTHONPATH settings such that the scheduler is able to parse the DAG 
> successfully, but the workers fails on import.
> # What we observed is that, on the worker side, all the tasks in the 
> problematic DAG are in "queued" state, while on the scheduler side, the 
> scheduler keeps requeue hundreds of thousands of duplicated tasks. As a 
> result, it quickly saturates the worker queue and blocks normal tasks to run. 
> I think a better way to handle this would be either mark the user task as 
> failed, or the scheduler has some rate limit in requeueing tasks, and leave 
> the cluster unaffected by user errors like this.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to