potiuk commented on PR #25266:
URL: https://github.com/apache/airflow/pull/25266#issuecomment-1195458243

   @ashb since you are back from vacation :)  - you might want to take a look. 
It looks like it is a proper fix and we could explain exactly the scenario that 
would have caused the nastly Postgres deadlock (so the REAL one this time). 
   
   When I implemented my solution, I assumed that task_run state change should 
also be shielded by the DagRun row-lock. I think it makes perfect sense, 
because we do not want any state change of any task instance belonging to 
DagRun when we are scheduling and keeping lock on the DagRun.
   
   Alternative solution (proposed initially by @dstaple who made the bulk of 
analysis) was to avoid the DagRun lock altogether when we are changing state 
during task_run. It seems to cure the deadlocks, but It has the nasty side 
effect that potentially shcheduler and task state change from task run are 
updating the same task instance and reading and then potentilly changing it's 
state at the same time. This is a bit more scalable (less synchronisation on 
DagRun, but quite a bit risky IMHO - and possibly that's also the reaosn for 
what @RNHTTR  reproduced in https://github.com/apache/airflow/issues/23361  - 
where he attempted to DELETE DagRun while the tasks were running (also 
resulting with deadlock). 
   
   The solution here should prevent from Deletion happening in parallel to 
changing state as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to