[ 
https://issues.apache.org/jira/browse/AIRFLOW-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaxil Naik updated AIRFLOW-4485:
--------------------------------
    Fix Version/s: 1.10.4

> All tasks stop running when using reschedule mode due to some tasks having 
> negative a try_number
> ------------------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4485
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4485
>             Project: Apache Airflow
>          Issue Type: Bug
>    Affects Versions: 1.10.3
>            Reporter: Teresa Martyny
>            Priority: Major
>             Fix For: 1.10.4
>
>
> When we use reschedule mode for our sensors, about an hour into our core 
> pipeline running, the following happens:
> 1. Negative try_number: We begin to see on the Scheduler `Executor reports 
> execution of [task info here] exited with status success for try_number -1` 
> .... this then proceeds to continue to decrement until it reaches try_number 
> -4 - With each run, -4 is the number where the following steps proceed to 
> play out:
> 2. We see a spike(and then stop) in this error message on the Scheduler: 
> `ERROR - Executor reports task instance {} finished ({}) although the task 
> says its {}. Was the task killed externally?` coming from 
> `airflow/jobs.py#_process_executor_events`
> 3. Sometimes followed by a few instances of the error on a single Worker: 
> `Celery command failed` coming from 
> `airflow/executors/celery_executor.py#execute_command`
> 4. Followed on the Worker by one instance of the error: `ZeroDivisionError` 
> 5. Followed by a spike in `ZeroDivisionError` on the Scheduler originating 
> from `airflow/models/__init__.py#next_retry_datetime` line 1183
> 6. The pipeline then grinds to a halt. Tasks sit in a scheduled state in the 
> scheduler, celery won't touch them. If try_numbers go negative, but never 
> make it to negative 4, it doesn't grind to a halt. 
>  
> We identified that the reschedule mode decrements the try_number in 
> `airflow/models/__init__.py#_handle_reschedule` 
> We did not identify why it never re-increments the `try_number` again to 
> ostensibly do what the code is attempting: use the same `try_number` and 
> write to the same log file.
> When we switched the sensors to use poke instead all of the above problems 
> stopped. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to