[ 
https://issues.apache.org/jira/browse/AIRFLOW-5071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520659#comment-17520659
 ] 

ASF GitHub Bot commented on AIRFLOW-5071:
-----------------------------------------

woodywuuu commented on issue #10790:
URL: https://github.com/apache/airflow/issues/10790#issuecomment-1095231809

   airflow: 2.2.2 with mysql8、 HA scheduler、celery executor(redis backend)
   
   From logs, it show that those ti reported this error `killed externally 
(status: success)` , were rescheduled! 
   1. scheduler found a ti to scheduled (ti from None to scheduled)
   2. scheduler queued ti(ti from scheduled to queued)
   3. scheduler send ti to celery
   4. worker get ti
   5. worker found ti‘s state in mysql  is scheduled 
https://github.com/apache/airflow/blob/2.2.2/airflow/models/taskinstance.py#L1224
   6. worker set this ti to None
   7. scheduler reschedule this ti
   8. scheduler could not queue this ti again, and found this ti success(in 
celery), so set it to failed
   
   From mysql we get that: all failed task has no external_executor_id!
   
   We use 5000 dags, each with 50 dummy task, found that, if the following two 
conditions are met,the probability of triggering this problem will highly 
increase:
   
   1. no external_executor_id was set to queued ti in celery 
https://github.com/apache/airflow/blob/2.2.2/airflow/jobs/scheduler_job.py#L537
      * This sql above has skip_locked, and some queued ti in celery may miss 
this external_executor_id. 
   10. a scheduler loop cost very long(more than 60s), 
`adopt_or_reset_orphaned_tasks` judge that schedulerJob failed, and try adopt 
orphaned ti 
https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L442
   
   We do these tests:
   1. patch `SchedulerJob. _process_executor_events `, not to set 
external_executor_id to those queued ti
      * 300+ dag failed with `killed externally (status: success)` normally 
less than 10
   2. patch `adopt_or_reset_orphaned_tasks`, not to adopt orphaned ti 
      * no dag failed !
   
   I read the notes 
[below](https://github.com/apache/airflow/blob/9ac742885ffb83c15f7e3dc910b0cf9df073407a/airflow/executors/celery_executor.py#L442)
 , but still don't understand this problems:
   1. why should we handle queued ti in celery and set this external id ?




> Thousand os Executor reports task instance X finished (success) although the 
> task says its queued. Was the task killed externally?
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5071
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5071
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: DAG, scheduler
>    Affects Versions: 1.10.3
>            Reporter: msempere
>            Priority: Critical
>             Fix For: 1.10.12
>
>         Attachments: image-2020-01-27-18-10-29-124.png, 
> image-2020-07-08-07-58-42-972.png
>
>
> I'm opening this issue because since I update to 1.10.3 I'm seeing thousands 
> of daily messages like the following in the logs:
>  
> ```
>  {{__init__.py:1580}} ERROR - Executor reports task instance <TaskInstance: X 
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says 
> its queued. Was the task killed externally?
> {{jobs.py:1484}} ERROR - Executor reports task instance <TaskInstance: X 
> 2019-07-29 00:00:00+00:00 [queued]> finished (success) although the task says 
> its queued. Was the task killed externally?
> ```
> -And looks like this is triggering also thousand of daily emails because the 
> flag to send email in case of failure is set to True.-
> I have Airflow setup to use Celery and Redis as a backend queue service.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to