Zhen Zhang created AIRFLOW-1329:
-----------------------------------

             Summary: Problematic DAG cause worker queue saturated
                 Key: AIRFLOW-1329
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1329
             Project: Apache Airflow
          Issue Type: Improvement
          Components: scheduler
            Reporter: Zhen Zhang


We see this weird issue in our production airflow cluster:

# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different 
PYTHONPATH settings such that the scheduler is able to parse the DAG 
successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the 
problematic DAG are in "queued" state, while on the scheduler side, the 
scheduler keeps requeue hundreds of thousands of tasks. As a result, it quickly 
saturates the worker queue and blocks normal tasks to run. 

I think a better way to handle this would be either mark the user task as 
failed, or the scheduler has some rate limit in requeueing tasks, and leave the 
cluster unaffected by user errors like this.
 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to