[GitHub] [airflow] potiuk commented on pull request #32991: Workaround failing dedlock when running backfill

via GitHub Sat, 05 Aug 2023 18:08:24 -0700


potiuk commented on PR #32991:
URL: https://github.com/apache/airflow/pull/32991#issuecomment-1666663312


   Hey everyone - I am not very proud of this PR, but I am cautiously 
optimisting it is **likely** workrarounding the curesed DeadLock that started 
to show up with test_backfill_job of DaskExecutor.
   
   I **believe** it was a real issue, not a test problem. And I **think** 
workaround I came up with should also handle similar "production" issues when 
they happen.
   
   It is not dissimilar to other similar fixes for backfill job causing 
deadlocks (you can see similar workarounds in other places).
   
   Generally speaking - i was able to reproduce it locally (happens once every 
30-50 times or so) so it is definitely not a "side-effect" of another test- I 
run this one individually with --with-db-init flag and still (with difficulty) 
I was able to produce deadlock. When I added some debugging it turned out that 
the deadlock in "commit()" in the outer loop was actually caused by a deadlock 
exception raised in the inner loop - right after `_per_task_process`  - then 
the exception was propagated up and then the finally clause in line ~ 960 
attempted to commit it again:
   
   ```
           finally:
               session.commit()
               executor.end()
   ```
   
   So we have not seen the original place where the deadlock occured - because 
we have not logged that OperationalError in the inner loop and the deadlock was 
reported again when the outer loop tried to run the commit for already 
deadlocked session.
   
   It looks like the deadlock is cause by the lack of "for update" on Dag Run 
in the backfill job, while the backfill job runs "for update" on task instance. 
So if scheduler happens to want to update task_instance and keeps the dag_run 
"for update", the backfill job tries to update.  
   
   BTW. Possibly this is an inherent problem with the backfill job. Maybe this 
is expected really and maybe the workaround is the best what we can do ?
   
   I gathered server-side logging where I could see both sides of the deadlock:
   
   ```
        Process 11417 waits for ShareLock on transaction 70151; blocked by 
process 11337.
   
           # !!!! This is what backfill job does !!!
   
        Process 11337: UPDATE dag_run SET last_scheduling_decision=NULL, 
updated_at='2023-08-05T23:07:39.040007+00:00'::timestamptz WHERE dag_run.id = 1
   
           # !!!!! This is what I presume Scheduler does !!!! 
   
        Process 11417: UPDATE task_instance SET 
start_date='2023-08-05T23:07:42.249709+00:00'::timestamptz, 
end_date='2023-08-05T23:07:42.249709+00:00'::timestamptz, duration=0.0, 
state='skipped', updated_at='2023-08-05T23:07:42.253235+00:00'::timestamptz 
WHERE task_instance.dag_id = 'example_bash_operator' AND task_instance.task_id 
= 'run_this_last' AND task_instance.run_id = 
'backfill__2017-01-01T00:00:00+00:00' AND task_instance.map_index =  -1
   ```
   
   
   I have few more examples of server logs where I was able to reproduce it 
locally https://gist.github.com/potiuk/6bd2d855a01ed3e8877a30cefcec01b3 
   
   I added a bit more logging and I specifically call `rollback()` in the inner 
loop - without re-raising the error. This **should not** be a problem because 
backfill job runs in a loop and it will again pick up the task next time. So 
the most harm that might happen is that the task that caused DeadLock will get 
slightly delayed.
   
   It was far more often replicable on CI (often 1 or 2 postgres jobs failed 
with it) so I will now close/reopen that PR few times tomorrow to see if I can 
make the deadlock happen in CI with this fix.
   
   I'd love to hear more from those who are better experts than me on this one. 
Maybe collectively we can solve the puzzle better than just via workaround.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on pull request #32991: Workaround failing dedlock when running backfill

Reply via email to