[
https://issues.apache.org/jira/browse/AIRFLOW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997628#comment-16997628
]
Björn Pollex commented on AIRFLOW-5071:
---------------------------------------
I have also run into this issue now, and have spent some time investigating. I
think there may be a race condition in the scheduler, when a task very quickly
reschedules, immediately after it has started running (I've been looking at
1.10.5). In the main scheduler loop in
[{{_execute_helper}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1357],
[{{self._execute_task_instances(...)}}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1430-L1431]
will enqueue tasks in state {{QUEUED}} to the executor. Next,
[{{executor.heartbeat()}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1439]
will actually send the tasks to the queue, where they can get picked up by a
worker. Now, the task starts running, and immediately returns with state
{{UP_FOR_RESCHEDULE}} (this is handled by the task instance itself on the
worker in
[{{_handle_reschedule()}}|https://github.com/apache/airflow/blob/1.10.5/airflow/models/taskinstance.py#L1024-L1047]).
On the scheduler node, the {{DagFileProcessor}} will find tasks that are
[{{UP_FOR_RESCHEDULE}}https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L758-L760],
and [set them to
{{SCHEDULED}}https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1593].
When a task reschedules, Celery reports the task as successful. The next time
the scheduler loop comes around, the {{SCHEDULER}} task gets {{QUEUED}} again,
but Celery has the information that the task was successful, and thus thus
inconsistency occurs.
I'm not sure if this is actually what's happening, this would probably require
some more investigation, but maybe this can help someone who knows this code
better than me. Does this make any sense?
> Thousand os Executor reports task instance X finished (success) although the
> task says its queued. Was the task killed externally?
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-5071
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5071
> Project: Apache Airflow
> Issue Type: Bug
> Components: DAG, scheduler
> Affects Versions: 1.10.3
> Reporter: msempere
> Priority: Critical
>
> I'm opening this issue because since I update to 1.10.3 I'm seeing thousands
> of daily messages like the following in the logs:
>
> ```
> {{__init__.py:1580}} ERROR - Executor reports task instance <TaskInstance: X
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says
> its queued. Was the task killed externally?
> {{jobs.py:1484}} ERROR - Executor reports task instance <TaskInstance: X
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says
> its queued. Was the task killed externally?
> ```
> -And looks like this is triggering also thousand of daily emails because the
> flag to send email in case of failure is set to True.-
> I have Airflow setup to use Celery and Redis as a backend queue service.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)