[
https://issues.apache.org/jira/browse/AIRFLOW-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987188#comment-15987188
]
Dan Davydov commented on AIRFLOW-1143:
--------------------------------------
I agree the NONE solution is OK for now (need to be careful though e.g. if the
"task is already in running state" dep fails we don't want to set the state to
NONE) but long term I think we want something more explicit (fixing heartbeat
so it supports this and having the scheduler set the state instead of the
worker). The reason is all of the state changes should eventually be done by
the scheduler (workers should be "dumb").
I don't see that message "FIXME: Rescheduling due to concurrency limits" but
that looks like some new TODO and we are running off of a subset of master at
the moment (between 1.8.0 and 1.8.1 with some stuff after 1.8.1) so I'm
guessing that's why.
> Tasks rejected by workers get stuck in QUEUED
> ---------------------------------------------
>
> Key: AIRFLOW-1143
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1143
> Project: Apache Airflow
> Issue Type: Bug
> Components: scheduler
> Reporter: Dan Davydov
> Assignee: Gerard Toonstra
>
> If the scheduler schedules a task that is sent to a worker that then rejects
> the task (e.g. because one of the dependencies of the tasks became bad, like
> the pool became full), the task will be stuck in the QUEUED state. We hit
> this trying to switch from invoking the scheduler "airflow scheduler -n 5" to
> just "airflow scheduler".
> Restarting the scheduler fixes this because it cleans up orphans, but we
> shouldn't have to restart the scheduler to fix these problems (the missing
> job heartbeats should make the scheduler requeue the task).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)