jscheffl commented on code in PR #64076:
URL: https://github.com/apache/airflow/pull/64076#discussion_r2984466329


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -2726,6 +2726,7 @@ def adopt_or_reset_orphaned_tasks(self, session: Session 
= NEW_SESSION) -> int:
                         .values(state=JobState.FAILED)
                     )
                     num_failed: int = getattr(result, "rowcount", 0)
+                    session.commit()  # Release any lock caused by flagging 
tasks

Review Comment:
   Thanks you are right. Regarind (2) I was in-preceise. We saw api-serviers 
got killed in liveness problem in K8s because the /health endpoints attempts to 
query the job table to report liveness and select was blocked by the lock. So a 
longer lasting lock while iterating over the TI list with the potential lock on 
the job table made API server (ALL at the same time!) be killed by K8s. Maing 
the API unavailable (HTTP502) which then stopped heartbeats by workers... and 
then finally killed all tasks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to