Zhen Zhang created AIRFLOW-1329:
-----------------------------------
Summary: Problematic DAG cause worker queue saturated
Key: AIRFLOW-1329
URL: https://issues.apache.org/jira/browse/AIRFLOW-1329
Project: Apache Airflow
Issue Type: Improvement
Components: scheduler
Reporter: Zhen Zhang
We see this weird issue in our production airflow cluster:
# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different
PYTHONPATH settings such that the scheduler is able to parse the DAG
successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the
problematic DAG are in "queued" state, while on the scheduler side, the
scheduler keeps requeue hundreds of thousands of tasks. As a result, it quickly
saturates the worker queue and blocks normal tasks to run.
I think a better way to handle this would be either mark the user task as
failed, or the scheduler has some rate limit in requeueing tasks, and leave the
cluster unaffected by user errors like this.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)