pdellarciprete commented on issue #57618: URL: https://github.com/apache/airflow/issues/57618#issuecomment-3631355864
> > One question on the proposed approach though: do you see any limitations or edge cases with using an optimistic locking strategy instead — by tightening the WHERE condition at write time — rather than introducing a pessimistic database lock to prevent concurrent scheduling? > > I’m curious whether an optimistic approach could sufficiently mitigate the race without the overhead and contention risks of DB-level locking. > > Are you saying that there is no possibility of having two schedulers picking the same TIs and both scheduling at the same time or a milliseconds apart? I am saying that, due to the high-availability (HA) design, it is entirely possible—and expected—that two schedulers will simultaneously pick up the same Task Instance (TI) as eligible during their respective read phases. The fix relies on forcing the subsequent write operation (the update to SCHEDULED) to become the exclusive claim check. The updates of the scheduler B that arrive 2nd affects 0 rows, and the Python code sees `rowcount = 0` This result signals to Scheduler B: "This Task Instance has already been successfully claimed and advanced by a competing process." Scheduler B immediately and safely discards management of that specific TI for the current step, preventing the flawed CASE statement from running and corrupting the `try_number`. The concurrency issue is sporadic because the race window is extremely narrow, so just when it happens the `rowcount` of the second update will be `0`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
