[ 
https://issues.apache.org/jira/browse/AIRFLOW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997628#comment-16997628
 ] 

Björn Pollex edited comment on AIRFLOW-5071 at 12/16/19 8:47 PM:
-----------------------------------------------------------------

I have also run into this issue now, and have spent some time investigating. I 
think there may be a race condition in the scheduler, when a task very quickly 
reschedules, immediately after it has started running (I've been looking at 
1.10.5). In the main scheduler loop in 
[{{_execute_helper}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1357],
 
[{{self._execute_task_instances(...)}}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1430-L1431]
 will enqueue tasks in state {{QUEUED}} to the executor. Next, 
[{{executor.heartbeat()}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1439]
 will actually send the tasks to the queue, where they can get picked up by a 
worker. Now, the task starts running, and immediately returns with state 
{{UP_FOR_RESCHEDULE}} (this is handled by the task instance itself on the 
worker in 
[{{_handle_reschedule()}}|https://github.com/apache/airflow/blob/1.10.5/airflow/models/taskinstance.py#L1024-L1047]).
 On the scheduler node, the {{DagFileProcessor}} will find tasks that are 
[{{UP_FOR_RESCHEDULE}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L758-L760],
 and [set them to 
{{SCHEDULED}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1593].
 When a task reschedules, Celery reports the task as successful. The next time 
the scheduler loop comes around, the {{SCHEDULER}} task gets {{QUEUED}} again, 
but Celery has the information that the task was successful, and thus thus 
inconsistency occurs.

I'm not sure if this is actually what's happening, this would probably require 
some more investigation, but maybe this can help someone who knows this code 
better than me. Does this make any sense?


was (Author: bjoern.pollex):
I have also run into this issue now, and have spent some time investigating. I 
think there may be a race condition in the scheduler, when a task very quickly 
reschedules, immediately after it has started running (I've been looking at 
1.10.5). In the main scheduler loop in 
[{{_execute_helper}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1357],
 
[{{self._execute_task_instances(...)}}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1430-L1431]
 will enqueue tasks in state {{QUEUED}} to the executor. Next, 
[{{executor.heartbeat()}}|https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1439]
 will actually send the tasks to the queue, where they can get picked up by a 
worker. Now, the task starts running, and immediately returns with state 
{{UP_FOR_RESCHEDULE}} (this is handled by the task instance itself on the 
worker in 
[{{_handle_reschedule()}}|https://github.com/apache/airflow/blob/1.10.5/airflow/models/taskinstance.py#L1024-L1047]).
 On the scheduler node, the {{DagFileProcessor}} will find tasks that are 
[{{UP_FOR_RESCHEDULE}}https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L758-L760],
 and [set them to 
{{SCHEDULED}}https://github.com/apache/airflow/blob/1.10.5/airflow/jobs/scheduler_job.py#L1593].
 When a task reschedules, Celery reports the task as successful. The next time 
the scheduler loop comes around, the {{SCHEDULER}} task gets {{QUEUED}} again, 
but Celery has the information that the task was successful, and thus thus 
inconsistency occurs.

I'm not sure if this is actually what's happening, this would probably require 
some more investigation, but maybe this can help someone who knows this code 
better than me. Does this make any sense?

> Thousand os Executor reports task instance X finished (success) although the 
> task says its queued. Was the task killed externally?
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5071
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5071
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, scheduler
>    Affects Versions: 1.10.3
>            Reporter: msempere
>            Priority: Critical
>
> I'm opening this issue because since I update to 1.10.3 I'm seeing thousands 
> of daily messages like the following in the logs:
>  
> ```
>  {{__init__.py:1580}} ERROR - Executor reports task instance <TaskInstance: X 
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says 
> its queued. Was the task killed externally?
> {{jobs.py:1484}} ERROR - Executor reports task instance <TaskInstance: X 
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says 
> its queued. Was the task killed externally?
> ```
> -And looks like this is triggering also thousand of daily emails because the 
> flag to send email in case of failure is set to True.-
> I have Airflow setup to use Celery and Redis as a backend queue service.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to