kk17 opened a new issue #19277:
URL: https://github.com/apache/airflow/issues/19277


   ### Apache Airflow version
   
   2.1.4
   
   ### Operating System
   
   Office Docker Image on Amazon Linux 2 Host
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   We deploy 2 schedulers, 2 workers,  2 webservers, and use MySQL8 as the 
Database backend. We have over 2000 small DAGs.
   
   ### What happened
   
   We have been using this 2.1.4 for all most one month. In recent days, we 
found some of the tasks failed because the following error:
   
   ```
   worker>     [2021-10-28 02:51:29,343] {local_task_job.py:208} WARNING - 
State of this instance has been externally set to None. Terminating instance.
   worker>     [2021-10-28 02:51:29,347] {process_utils.py:100} INFO - Sending 
Signals.SIGTERM to GPID 25492
   worker>     [2021-10-28 02:51:29,347] {taskinstance.py:1236} ERROR - 
Received SIGTERM. Terminating subprocesses.
   worker>     [2021-10-28 02:51:29,348] {custom_task_operator.py:230} INFO - 
receive kill signal , kill batch result
   worker>     [2021-10-28 02:51:29,348] {custom_task_operator.py:232} INFO - 
start kill batch task :434649
   ```
   
   This kind of error only happened occasionally once or twice a day for some 
executation_date for some tasks in some different DAGs.  I checked all the 
schedulers and worker log related to a failed task. Below are the logs combined 
and sort by time.
   
   ```
   scheduler2> [2021-10-28 02:51:23,676] {base_executor.py:82} INFO - Adding to 
queue: ['airflow', 'tasks', 'run', 'schedule_12238', 'task_schedule_12238', 
'2021-10-28T02:35:00+00:00', '--local', '--pool', 'livy_pool', '--subdir', 
'/usr/local/airflow/dags/schedule_12238.py']
   scheduler2> [2021-10-28 02:51:25,450] {scheduler_job.py:612} INFO - Executor 
reports execution of schedule_12238.task_schedule_12238 
execution_date=2021-10-28 02:35:00+00:00 exited with status queued for 
try_number 1
   scheduler2> [2021-10-28 02:51:25,483] {scheduler_job.py:639} INFO - Setting 
external_id for <TaskInstance: schedule_12238.task_schedule_12238 2021-10-28 
02:35:00+00:00 [None]> to 28123ae7-5f11-4cca-8db7-8509d7794299
   
   worker>     [2021-10-28 02:51:29,343] {local_task_job.py:208} WARNING - 
State of this instance has been externally set to None. Terminating instance.
   worker>     [2021-10-28 02:51:29,347] {process_utils.py:100} INFO - Sending 
Signals.SIGTERM to GPID 25492
   worker>     [2021-10-28 02:51:29,347] {taskinstance.py:1236} ERROR - 
Received SIGTERM. Terminating subprocesses.
   worker>     [2021-10-28 02:51:29,348] {custom_task_operator.py:230} INFO - 
receive kill signal , kill batch result
   worker>     [2021-10-28 02:51:29,348] {custom_task_operator.py:232} INFO - 
start kill batch task :434649
   
   scheduler2> [2021-10-28 02:51:52,225] {scheduler_job.py:612} INFO - Executor 
reports execution of schedule_12238.task_schedule_12238 
execution_date=2021-10-28 02:35:00+00:00 exited with status success for 
try_number 1
   
   scheduler1> [2021-10-28 02:53:01,244] {scheduler_job.py:411} INFO - DAG 
schedule_12238 has 0/10 running and queued tasks
   scheduler1> [2021-10-28 02:53:01,255] {scheduler_job.py:518} INFO - Sending 
TaskInstanceKey(dag_id='schedule_12238', task_id='task_schedule_12238', 
execution_date=datetime.datetime(2021, 10, 28, 2, 35, tzinfo=Timezone('UTC')), 
try_number=2) to executor with priority 1 and queue default
   scheduler1> [2021-10-28 02:53:01,255] {base_executor.py:82} INFO - Adding to 
queue: ['airflow', 'tasks', 'run', 'schedule_12238', 'task_schedule_12238', 
'2021-10-28T02:35:00+00:00', '--local', '--pool', 'livy_pool', '--subdir', 
'/usr/local/airflow/dags/schedule_12238.py']
   scheduler1> [2021-10-28 02:53:02,272] {scheduler_job.py:612} INFO - Executor 
reports execution of schedule_12238.task_schedule_12238 
execution_date=2021-10-28 02:35:00+00:00 exited with status queued for 
try_number 2
   scheduler1> [2021-10-28 02:53:02,295] {scheduler_job.py:639} INFO - Setting 
external_id for <TaskInstance: schedule_12238.task_schedule_12238 2021-10-28 
02:35:00+00:00 [running]> to d09a34e8-627a-4b47-9cbc-b1bbdbdc1023
   scheduler1> [2021-10-28 02:54:53,773] {scheduler_job.py:612} INFO - Executor 
reports execution of schedule_12238.task_schedule_12238 
execution_date=2021-10-28 02:35:00+00:00 exited with status success for 
try_number 2
   
   scheduler2> [2021-10-28 02:54:53,795] {dagrun.py:477} INFO - Marking run 
<DagRun schedule_12238 @ 2021-10-28 02:35:00+00:00: 
scheduled__2021-10-28T02:35:00+00:00, externally triggered: False> successful
   ```
   
   We can see after scheduler 2 output this line:
   ```
   scheduler2> [2021-10-28 02:51:25,483] {scheduler_job.py:639} INFO - Setting 
external_id for <TaskInstance: schedule_12238.task_schedule_12238 2021-10-28 
02:35:00+00:00 [None]> to 28123ae7-5f11-4cca-8db7-8509d7794299
   ```
   The worker output `State of this instance has been externally set to None`. 
So I doubt that it was scheduler 2 wrongly setting the None state to the task 
instance.  I checked the related code in scheduler_job.py, due to my limited 
understanding of the airflow source code, I could find out the exact cause. Can 
anyone confirm whether this is an airflow bug?
   
   
   
   ### What you expected to happen
   
   No `State of this instance has been externally set to None` error happened.
   
   ### How to reproduce
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to