potiuk commented on PR #32991:
URL: https://github.com/apache/airflow/pull/32991#issuecomment-1666663312
Hey everyone - I am not very proud of this PR, but I am cautiously
optimisting it is **likely** workrarounding the curesed DeadLock that started
to show up with test_backfill_job of DaskExecutor.
I **believe** it was a real issue, not a test problem. And I **think**
workaround I came up with should also handle similar "production" issues when
they happen.
It is not dissimilar to other similar fixes for backfill job causing
deadlocks (you can see similar workarounds in other places).
Generally speaking - i was able to reproduce it locally (happens once every
30-50 times or so) so it is definitely not a "side-effect" of another test- I
run this one individually with --with-db-init flag and still (with difficulty)
I was able to produce deadlock. When I added some debugging it turned out that
the deadlock in "commit()" in the outer loop was actually caused by a deadlock
exception raised in the inner loop - right after `_per_task_process` - then
the exception was propagated up and then the finally clause in line ~ 960
attempted to commit it again:
```
finally:
session.commit()
executor.end()
```
So we have not seen the original place where the deadlock occured - because
we have not logged that OperationalError in the inner loop and the deadlock was
reported again when the outer loop tried to run the commit for already
deadlocked session.
It looks like the deadlock is cause by the lack of "for update" on Dag Run
in the backfill job, while the backfill job runs "for update" on task instance.
So if scheduler happens to want to update task_instance and keeps the dag_run
"for update", the backfill job tries to update.
BTW. Possibly this is an inherent problem with the backfill job. Maybe this
is expected really and maybe the workaround is the best what we can do ?
I gathered server-side logging where I could see both sides of the deadlock:
```
Process 11417 waits for ShareLock on transaction 70151; blocked by
process 11337.
# !!!! This is what backfill job does !!!
Process 11337: UPDATE dag_run SET last_scheduling_decision=NULL,
updated_at='2023-08-05T23:07:39.040007+00:00'::timestamptz WHERE dag_run.id = 1
# !!!!! This is what I presume Scheduler does !!!!
Process 11417: UPDATE task_instance SET
start_date='2023-08-05T23:07:42.249709+00:00'::timestamptz,
end_date='2023-08-05T23:07:42.249709+00:00'::timestamptz, duration=0.0,
state='skipped', updated_at='2023-08-05T23:07:42.253235+00:00'::timestamptz
WHERE task_instance.dag_id = 'example_bash_operator' AND task_instance.task_id
= 'run_this_last' AND task_instance.run_id =
'backfill__2017-01-01T00:00:00+00:00' AND task_instance.map_index = -1
```
I have few more examples of server logs where I was able to reproduce it
locally https://gist.github.com/potiuk/6bd2d855a01ed3e8877a30cefcec01b3
I added a bit more logging and I specifically call `rollback()` in the inner
loop - without re-raising the error. This **should not** be a problem because
backfill job runs in a loop and it will again pick up the task next time. So
the most harm that might happen is that the task that caused DeadLock will get
slightly delayed.
It was far more often replicable on CI (often 1 or 2 postgres jobs failed
with it) so I will now close/reopen that PR few times tomorrow to see if I can
make the deadlock happen in CI with this fix.
I'd love to hear more from those who are better experts than me on this one.
Maybe collectively we can solve the puzzle better than just via workaround.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]