potiuk commented on PR #25266: URL: https://github.com/apache/airflow/pull/25266#issuecomment-1195458243
@ashb since you are back from vacation :) - you might want to take a look. It looks like it is a proper fix and we could explain exactly the scenario that would have caused the nastly Postgres deadlock (so the REAL one this time). When I implemented my solution, I assumed that task_run state change should also be shielded by the DagRun row-lock. I think it makes perfect sense, because we do not want any state change of any task instance belonging to DagRun when we are scheduling and keeping lock on the DagRun. Alternative solution (proposed initially by @dstaple who made the bulk of analysis) was to avoid the DagRun lock altogether when we are changing state during task_run. It seems to cure the deadlocks, but It has the nasty side effect that potentially shcheduler and task state change from task run are updating the same task instance and reading and then potentilly changing it's state at the same time. This is a bit more scalable (less synchronisation on DagRun, but quite a bit risky IMHO - and possibly that's also the reaosn for what @RNHTTR reproduced in https://github.com/apache/airflow/issues/23361 - where he attempted to DELETE DagRun while the tasks were running (also resulting with deadlock). The solution here should prevent from Deletion happening in parallel to changing state as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
