o-nikolas commented on code in PR #53435:
URL: https://github.com/apache/airflow/pull/53435#discussion_r2220158382
##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -2016,7 +2017,7 @@ def _maybe_requeue_stuck_ti(self, *, ti, session):
extra=f"Task was requeued more than
{self._num_stuck_queued_retries} times and will be failed.",
)
)
- ti.set_state(TaskInstanceState.FAILED, session=session)
+ executor.fail(ti.key)
Review Comment:
> BaseExecutor fail (last line):
This is just changing the executors state of the task not the
schedulers/task's state (triggering an event to be sent to the scheduler). You
can see that from @karenbraganz logs/test that the mechanism that is meant to
detect drift, i.e. when the executor's view of task state does not match the
scheduler's view of state (which is really the ORM TI state), is being
triggered here. The scheduler in that case ultimately marks the task as failed
after logging that message to the users.
So while this works, I'm not sure this is the best approach, because we're
going in a very wide circle to get the job done. We shouldn't be leaning on
that drift detection mechanism, because in this scenario the scheduler does
know the current state (it's the entity kicking all this in motion after all!),
so I would prefer we update both things here. Set the TI state and also set the
executor state to fail. If you have the time to test the outcome of that
@karenbraganz that would be tremendous! I really appreciate your engagement on
this PR and the GH Issue that started it, I really liked your investigation
there, nice work! 😃
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]