wolfier commented on issue #56196: URL: https://github.com/apache/airflow/issues/56196#issuecomment-3359430986
> What would you expect to happen in this case? If the task instance in the above case fails, the task state will move directly to `failed` and when [finalized](https://github.com/apache/airflow/blob/3.1.0/task-sdk/src/airflow/sdk/execution_time/task_runner.py#L1452) the `on_failure_callback` will be executed. This is not the situation described in the issue as this is the code path where the task execution command completed successfully. The original issue is when the scheduler handles the failure of an externally killed task instance. > there are 3 retries but no retries will ever be invoked. Do you expect on_retry_callback to be called? I do not expect `on_retry_callback` to be invoked. --- There are two distinct code paths. * The task execution command fails unexpectedly which is generally considered as the task instance killed externally. This leads to the **scheduler** sending the `on_failure_callback` to the dag processor to be invoked. * If `on_failure_callback` and `on_retry_callback` is set, the `on_failure_callback` is invoked. * If `on_failure_callback` is set and `on_retry_callback` is not set, the `on_failure_callback` is invoked. * If `on_failure_callback` is not set and `on_retry_callback` is set, a [warning](https://github.com/apache/airflow/blob/3.1.0/airflow-core/src/airflow/dag_processing/processor.py#L309-L322) is printed. (I believe) * The task execution command completes successfully and the execution results in an exception. This leads to the **worker** [resolving the task instance state](https://github.com/apache/airflow/blob/3.1.0/task-sdk/src/airflow/sdk/execution_time/task_runner.py#L1450) and [finalizing the task instance](https://github.com/apache/airflow/blob/3.1.0/task-sdk/src/airflow/sdk/execution_time/task_runner.py#L1452) which includes executing callbacks. Currently in Airflow 3.1, when the task execution command fails unexpectedly, the `on_failure_callback` is invoked. If the task instance is eligible for retry, the task will be retried. I believe the behaviour should be the `on_retry_callback` is executed if present otherwise no callback is invoked since the task did enter the `failed` state. > You describe a case where the task is killed. Killed = task goes to failure state which means that on_failure_callback must be executed so the issue here is not if on_failure_callback should be executed - the answer to that is always yes. To clarify, it is only the **instance** of task execution that is "killed". I used the term "killed externally" because that is how Airflow describes the situation. The task instance can be scheduled and queued again if it is eligible for retry. Additionally, the task instance never reached the `failed` state. Since the task instance is eligible for retry, the task instance state is set to `up_for_retry`. > The issue here is if on_retry_callback should be called as well. I think only the `on_retry_callback` should be executed if the task is eligible for retries. If the task is not eligible for retires, the current behaviour is sufficient thought the implied `on_failure_callback` execution can be confusing. ```python request = TaskCallbackRequest( filepath=ti.dag_model.relative_fileloc, bundle_name=ti.dag_version.bundle_name, bundle_version=ti.dag_version.bundle_version, ti=ti, msg=msg, context_from_server=TIRunContext( dag_run=DRDataModel.model_validate(ti.dag_run, from_attributes=True), max_tries=ti.max_tries, variables=[], connections=[], xcom_keys_to_clear=[], ), task_callback_type=TaskInstanceState.FAILED, # Explicitly denote the task callback type ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
