shopee-jin opened a new issue #15527:
URL: https://github.com/apache/airflow/issues/15527
**Apache Airflow version**:
1.10.9
**Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
NOT running containers
**Environment**:
Ubuntu 18.04.5 LTS
Redis, CeleryExecutor
**What happened**:
We have >10k sensor tasks in the airflow cluster and we use `Reschedule`
mode to save resources.
In extremely rare scenarios (**estimated 1 out of a million**), a sensor
task could stuck forever.
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/jobs/scheduler_job.py#L1316
Scheduler received a executor failure event,
`2021-04-25 11:02:46,402 INFO - Executor reports execution of
dag_vision_base_MY.xxxx execution_date=2021-04-24 06:00:00+08:00 exited with
status failed for try_number 1
`
After that, the task stuck forever.
`[2021-04-25 11:02:43,074] {scheduler_job.py:1604} INFO - Creating /
updating <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00
[scheduled]> in ORM
[2021-04-25 11:03:16,631] {logging_mixin.py:112} INFO - [2021-04-25
11:03:16,631] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Not In
Retry Period' PASSED: True, The context specified that being in a retry period
was permitted.
[2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25
11:03:16,631] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Trigger
Rule' PASSED: True, The task instance did not have any upstream tasks.
[2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25
11:03:16,632] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency
'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
[2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25
11:03:16,632] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Ready
To Reschedule' PASSED: True, The context specified that being in a reschedule
period was permitted.
[2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25
11:03:16,632] {taskinstance.py:655} DEBUG - Dependencies all met for
<TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]>
[2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Not In
Retry Period' PASSED: True, The context specified that being in a retry period
was permitted.
[2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Trigger
Rule' PASSED: True, The task instance did not have any upstream tasks.
[2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency
'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
[2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance:
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Ready
To Reschedule' PASSED: True, The context specified that being in a reschedule
period was permitted.
[2021-04-25 11:03:50,524] {logging_mixin.py:112} INFO - [2021-04-25
11:03:50,523] {taskinstance.py:655} DEBUG - Dependencies all met for
<TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]>`
**What you expected to happen**:
All sensor tasks should be rescheduled correctly no matter it returned
success or failure.
<!-- What do you think went wrong? -->
**How to reproduce it**:
[Extremely Flaky]
**Anything else we need to know**:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]