[
https://issues.apache.org/jira/browse/AIRFLOW-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kaxil Naik updated AIRFLOW-4485:
--------------------------------
Fix Version/s: 1.10.4
> All tasks stop running when using reschedule mode due to some tasks having
> negative a try_number
> ------------------------------------------------------------------------------------------------
>
> Key: AIRFLOW-4485
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4485
> Project: Apache Airflow
> Issue Type: Bug
> Affects Versions: 1.10.3
> Reporter: Teresa Martyny
> Priority: Major
> Fix For: 1.10.4
>
>
> When we use reschedule mode for our sensors, about an hour into our core
> pipeline running, the following happens:
> 1. Negative try_number: We begin to see on the Scheduler `Executor reports
> execution of [task info here] exited with status success for try_number -1`
> .... this then proceeds to continue to decrement until it reaches try_number
> -4 - With each run, -4 is the number where the following steps proceed to
> play out:
> 2. We see a spike(and then stop) in this error message on the Scheduler:
> `ERROR - Executor reports task instance {} finished ({}) although the task
> says its {}. Was the task killed externally?` coming from
> `airflow/jobs.py#_process_executor_events`
> 3. Sometimes followed by a few instances of the error on a single Worker:
> `Celery command failed` coming from
> `airflow/executors/celery_executor.py#execute_command`
> 4. Followed on the Worker by one instance of the error: `ZeroDivisionError`
> 5. Followed by a spike in `ZeroDivisionError` on the Scheduler originating
> from `airflow/models/__init__.py#next_retry_datetime` line 1183
> 6. The pipeline then grinds to a halt. Tasks sit in a scheduled state in the
> scheduler, celery won't touch them. If try_numbers go negative, but never
> make it to negative 4, it doesn't grind to a halt.
>
> We identified that the reschedule mode decrements the try_number in
> `airflow/models/__init__.py#_handle_reschedule`
> We did not identify why it never re-increments the `try_number` again to
> ostensibly do what the code is attempting: use the same `try_number` and
> write to the same log file.
> When we switched the sensors to use poke instead all of the above problems
> stopped.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)