pdellarciprete commented on issue #57618:
URL: https://github.com/apache/airflow/issues/57618#issuecomment-3631355864

   > > One question on the proposed approach though: do you see any limitations 
or edge cases with using an optimistic locking strategy instead — by tightening 
the WHERE condition at write time — rather than introducing a pessimistic 
database lock to prevent concurrent scheduling?
   > > I’m curious whether an optimistic approach could sufficiently mitigate 
the race without the overhead and contention risks of DB-level locking.
   > 
   > Are you saying that there is no possibility of having two schedulers 
picking the same TIs and both scheduling at the same time or a milliseconds 
apart?
   
   I am saying that, due to the high-availability (HA) design, it is entirely 
possible—and expected—that two schedulers will simultaneously pick up the same 
Task Instance (TI) as eligible during their respective read phases.
   The fix relies on forcing the subsequent write operation (the update to 
SCHEDULED) to become the exclusive claim check.
   
   The updates of the scheduler B that arrive 2nd affects 0 rows, and the 
Python code sees `rowcount = 0` 
   
   This result signals to Scheduler B: "This Task Instance has already been 
successfully claimed and advanced by a competing process." Scheduler B 
immediately and safely discards management of that specific TI for the current 
step, preventing the flawed CASE statement from running and corrupting the 
`try_number`.
   
   The concurrency issue is sporadic because the race window is extremely 
narrow, so just when it happens the `rowcount` of the second update will be `0`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to