[
https://issues.apache.org/jira/browse/AIRFLOW-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhen Zhang updated AIRFLOW-1329:
--------------------------------
Description:
We see this weird issue in our production airflow cluster:
# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different
PYTHONPATH settings such that the scheduler is able to parse the DAG
successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the
problematic DAG are in "queued" state, while on the scheduler side, the
scheduler keeps requeue hundreds of thousands of duplicated tasks. As a result,
it quickly saturates the worker queue and blocks normal tasks to run.
I think a better way to handle this would be either mark the user task as
failed, or the scheduler has some rate limit in requeueing tasks, and leave the
cluster unaffected by user errors like this.
was:
We see this weird issue in our production airflow cluster:
# User has a problematic import statement in DAG definition.
# For some still unknown reasons, our scheduler and workers have different
PYTHONPATH settings such that the scheduler is able to parse the DAG
successfully, but the workers fails on import.
# What we observed is that, on the worker side, all the tasks in the
problematic DAG are in "queued" state, while on the scheduler side, the
scheduler keeps requeue hundreds of thousands of tasks. As a result, it quickly
saturates the worker queue and blocks normal tasks to run.
I think a better way to handle this would be either mark the user task as
failed, or the scheduler has some rate limit in requeueing tasks, and leave the
cluster unaffected by user errors like this.
> Problematic DAG cause worker queue saturated
> --------------------------------------------
>
> Key: AIRFLOW-1329
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1329
> Project: Apache Airflow
> Issue Type: Improvement
> Components: scheduler
> Reporter: Zhen Zhang
>
> We see this weird issue in our production airflow cluster:
> # User has a problematic import statement in DAG definition.
> # For some still unknown reasons, our scheduler and workers have different
> PYTHONPATH settings such that the scheduler is able to parse the DAG
> successfully, but the workers fails on import.
> # What we observed is that, on the worker side, all the tasks in the
> problematic DAG are in "queued" state, while on the scheduler side, the
> scheduler keeps requeue hundreds of thousands of duplicated tasks. As a
> result, it quickly saturates the worker queue and blocks normal tasks to run.
> I think a better way to handle this would be either mark the user task as
> failed, or the scheduler has some rate limit in requeueing tasks, and leave
> the cluster unaffected by user errors like this.
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)