o-nikolas commented on code in PR #53435:
URL: https://github.com/apache/airflow/pull/53435#discussion_r2220158382


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -2016,7 +2017,7 @@ def _maybe_requeue_stuck_ti(self, *, ti, session):
                     extra=f"Task was requeued more than 
{self._num_stuck_queued_retries} times and will be failed.",
                 )
             )
-            ti.set_state(TaskInstanceState.FAILED, session=session)
+            executor.fail(ti.key)

Review Comment:
   > BaseExecutor fail (last line):
   
   This is just changing the executors state of the task not the 
schedulers/task's state  (triggering an event to be sent to the scheduler). You 
can see that from @karenbraganz logs/test that the mechanism that is meant to 
detect drift, i.e. when the executor's view of task state does not match the 
scheduler's view of state (which is really the ORM TI state), is being 
triggered here. The scheduler in that case ultimately marks the task as failed 
after logging that message to the users.
   
   So while this works, I'm not sure this is the best approach, because we're 
going in a very wide circle to get the job done. We shouldn't be leaning on 
that drift detection mechanism, because in this scenario the scheduler does 
know the current state (it's the entity kicking all this in motion after all!), 
so I would prefer we update both things here. Set the TI state and also set the 
executor state to fail. If you have the time to test the outcome of that 
@karenbraganz that would be tremendous! I really appreciate your engagement on 
this PR and the GH Issue that started it, I really liked your investigation 
there, nice work! 😃 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to