[
https://issues.apache.org/jira/browse/AIRFLOW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523708#comment-16523708
]
James Meickle commented on AIRFLOW-1463:
----------------------------------------
We ran into this in production last night. Our work instance ran out of memory;
we suspect that it pulled messages from Celery, but then could not fork new
worker processes. This resulted in a state where the task didn't exist in
Celery, but the Scheduler thought it did.
I would have expected this check to result in the
`SCHEDULED`-but-missing-from-Celery tasks eventually getting reset:
[https://github.com/apache/incubator-airflow/blob/1.9.0/airflow/jobs.py#L213]
But it looks like this only runs on scheduler startup, and not periodically?
> Scheduler does not reschedule tasks in QUEUED state
> ---------------------------------------------------
>
> Key: AIRFLOW-1463
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1463
> Project: Apache Airflow
> Issue Type: Improvement
> Components: cli
> Environment: Ubuntu 14.04
> Airflow 1.8.0
> SQS backed task queue, AWS RDS backed meta storage
> DAG folder is synced by script on code push: archive is downloaded from s3,
> unpacked, moved, install script is run. airflow executable is replaced with
> symlink pointing to the latest version of code, no airflow processes are
> restarted.
> Reporter: Stanislav Pak
> Priority: Minor
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Our pipelines related code is deployed almost simultaneously on all airflow
> boxes: scheduler+webserver box, workers boxes. Some common python package is
> deployed on those boxes on every other code push (3-5 deployments per hour).
> Due to installation specifics, a DAG that imports module from that package
> might fail. If DAG import fails when worker runs a task, the task is still
> removed from the queue but task state is not changed, so in this case the
> task stays in QUEUED state forever.
> Beside the described case, there is scenario when it happens because of DAG
> update lag in scheduler. A task can be scheduled with old DAG and worker can
> run the task with new DAG that fails to be imported.
> There might be other scenarios when it happens.
> Proposal:
> Catch errors when importing DAG on task run and clear task instance state if
> import fails. This should fix transient issues of this kind.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)