kk17 opened a new issue #19277:
URL: https://github.com/apache/airflow/issues/19277
### Apache Airflow version
2.1.4
### Operating System
Office Docker Image on Amazon Linux 2 Host
### Versions of Apache Airflow Providers
_No response_
### Deployment
Docker-Compose
### Deployment details
We deploy 2 schedulers, 2 workers, 2 webservers, and use MySQL8 as the
Database backend. We have over 2000 small DAGs.
### What happened
We have been using this 2.1.4 for all most one month. In recent days, we
found some of the tasks failed because the following error:
```
worker> [2021-10-28 02:51:29,343] {local_task_job.py:208} WARNING -
State of this instance has been externally set to None. Terminating instance.
worker> [2021-10-28 02:51:29,347] {process_utils.py:100} INFO - Sending
Signals.SIGTERM to GPID 25492
worker> [2021-10-28 02:51:29,347] {taskinstance.py:1236} ERROR -
Received SIGTERM. Terminating subprocesses.
worker> [2021-10-28 02:51:29,348] {custom_task_operator.py:230} INFO -
receive kill signal , kill batch result
worker> [2021-10-28 02:51:29,348] {custom_task_operator.py:232} INFO -
start kill batch task :434649
```
This kind of error only happened occasionally once or twice a day for some
executation_date for some tasks in some different DAGs. I checked all the
schedulers and worker log related to a failed task. Below are the logs combined
and sort by time.
```
scheduler2> [2021-10-28 02:51:23,676] {base_executor.py:82} INFO - Adding to
queue: ['airflow', 'tasks', 'run', 'schedule_12238', 'task_schedule_12238',
'2021-10-28T02:35:00+00:00', '--local', '--pool', 'livy_pool', '--subdir',
'/usr/local/airflow/dags/schedule_12238.py']
scheduler2> [2021-10-28 02:51:25,450] {scheduler_job.py:612} INFO - Executor
reports execution of schedule_12238.task_schedule_12238
execution_date=2021-10-28 02:35:00+00:00 exited with status queued for
try_number 1
scheduler2> [2021-10-28 02:51:25,483] {scheduler_job.py:639} INFO - Setting
external_id for <TaskInstance: schedule_12238.task_schedule_12238 2021-10-28
02:35:00+00:00 [None]> to 28123ae7-5f11-4cca-8db7-8509d7794299
worker> [2021-10-28 02:51:29,343] {local_task_job.py:208} WARNING -
State of this instance has been externally set to None. Terminating instance.
worker> [2021-10-28 02:51:29,347] {process_utils.py:100} INFO - Sending
Signals.SIGTERM to GPID 25492
worker> [2021-10-28 02:51:29,347] {taskinstance.py:1236} ERROR -
Received SIGTERM. Terminating subprocesses.
worker> [2021-10-28 02:51:29,348] {custom_task_operator.py:230} INFO -
receive kill signal , kill batch result
worker> [2021-10-28 02:51:29,348] {custom_task_operator.py:232} INFO -
start kill batch task :434649
scheduler2> [2021-10-28 02:51:52,225] {scheduler_job.py:612} INFO - Executor
reports execution of schedule_12238.task_schedule_12238
execution_date=2021-10-28 02:35:00+00:00 exited with status success for
try_number 1
scheduler1> [2021-10-28 02:53:01,244] {scheduler_job.py:411} INFO - DAG
schedule_12238 has 0/10 running and queued tasks
scheduler1> [2021-10-28 02:53:01,255] {scheduler_job.py:518} INFO - Sending
TaskInstanceKey(dag_id='schedule_12238', task_id='task_schedule_12238',
execution_date=datetime.datetime(2021, 10, 28, 2, 35, tzinfo=Timezone('UTC')),
try_number=2) to executor with priority 1 and queue default
scheduler1> [2021-10-28 02:53:01,255] {base_executor.py:82} INFO - Adding to
queue: ['airflow', 'tasks', 'run', 'schedule_12238', 'task_schedule_12238',
'2021-10-28T02:35:00+00:00', '--local', '--pool', 'livy_pool', '--subdir',
'/usr/local/airflow/dags/schedule_12238.py']
scheduler1> [2021-10-28 02:53:02,272] {scheduler_job.py:612} INFO - Executor
reports execution of schedule_12238.task_schedule_12238
execution_date=2021-10-28 02:35:00+00:00 exited with status queued for
try_number 2
scheduler1> [2021-10-28 02:53:02,295] {scheduler_job.py:639} INFO - Setting
external_id for <TaskInstance: schedule_12238.task_schedule_12238 2021-10-28
02:35:00+00:00 [running]> to d09a34e8-627a-4b47-9cbc-b1bbdbdc1023
scheduler1> [2021-10-28 02:54:53,773] {scheduler_job.py:612} INFO - Executor
reports execution of schedule_12238.task_schedule_12238
execution_date=2021-10-28 02:35:00+00:00 exited with status success for
try_number 2
scheduler2> [2021-10-28 02:54:53,795] {dagrun.py:477} INFO - Marking run
<DagRun schedule_12238 @ 2021-10-28 02:35:00+00:00:
scheduled__2021-10-28T02:35:00+00:00, externally triggered: False> successful
```
We can see after scheduler 2 output this line:
```
scheduler2> [2021-10-28 02:51:25,483] {scheduler_job.py:639} INFO - Setting
external_id for <TaskInstance: schedule_12238.task_schedule_12238 2021-10-28
02:35:00+00:00 [None]> to 28123ae7-5f11-4cca-8db7-8509d7794299
```
The worker output `State of this instance has been externally set to None`.
So I doubt that it was scheduler 2 wrongly setting the None state to the task
instance. I checked the related code in scheduler_job.py, due to my limited
understanding of the airflow source code, I could find out the exact cause. Can
anyone confirm whether this is an airflow bug?
### What you expected to happen
No `State of this instance has been externally set to None` error happened.
### How to reproduce
_No response_
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]