shopee-jin opened a new issue #15527:
URL: https://github.com/apache/airflow/issues/15527


   **Apache Airflow version**:
   1.10.9
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   NOT running containers
   
   **Environment**:
   Ubuntu 18.04.5 LTS
   Redis, CeleryExecutor
   
   **What happened**:
   We have >10k sensor tasks in the airflow cluster and we use `Reschedule` 
mode to save resources.
   In extremely rare scenarios (**estimated 1 out of a million**), a sensor 
task could stuck forever.
    
   
https://github.com/apache/airflow/blob/4aec433e48dcc66c9c7b74947c499260ab6be9e9/airflow/jobs/scheduler_job.py#L1316
   
   Scheduler received a executor failure event,
   
   `2021-04-25 11:02:46,402 INFO - Executor reports execution of 
dag_vision_base_MY.xxxx execution_date=2021-04-24 06:00:00+08:00 exited with 
status failed for try_number 1
   `
   
   After that,  the task stuck forever.
   
   `[2021-04-25 11:02:43,074] {scheduler_job.py:1604} INFO - Creating / 
updating <TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 
[scheduled]> in ORM
   [2021-04-25 11:03:16,631] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:16,631] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Not In 
Retry Period' PASSED: True, The context specified that being in a retry period 
was permitted.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:16,631] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Trigger 
Rule' PASSED: True, The task instance did not have any upstream tasks.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:16,632] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 
'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:16,632] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Ready 
To Reschedule' PASSED: True, The context specified that being in a reschedule 
period was permitted.
   [2021-04-25 11:03:16,632] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:16,632] {taskinstance.py:655} DEBUG - Dependencies all met for 
<TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]>
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Not In 
Retry Period' PASSED: True, The context specified that being in a retry period 
was permitted.
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Trigger 
Rule' PASSED: True, The task instance did not have any upstream tasks.
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 
'Previous Dagrun State' PASSED: True, The task did not have depends_on_past set.
   [2021-04-25 11:03:50,523] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:50,523] {taskinstance.py:672} DEBUG - <TaskInstance: 
dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]> dependency 'Ready 
To Reschedule' PASSED: True, The context specified that being in a reschedule 
period was permitted.
   [2021-04-25 11:03:50,524] {logging_mixin.py:112} INFO - [2021-04-25 
11:03:50,523] {taskinstance.py:655} DEBUG - Dependencies all met for 
<TaskInstance: dag_vision_base_MY.xxxx 2021-04-24 06:00:00+08:00 [queued]>`
   
   **What you expected to happen**:
   All sensor tasks should be rescheduled correctly no matter it returned 
success or failure.
   
   <!-- What do you think went wrong? -->
   
   
   **How to reproduce it**:
   [Extremely Flaky]
   
   **Anything else we need to know**:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to