mwisnicki commented on issue #68699:
URL: https://github.com/apache/airflow/issues/68699#issuecomment-4738551761

   I think it's not limited to concurrent backfills since I also triggered it 
with just one
   
   `<🤖>`
   ---
   
   > is this bug limited to concurrent backfill requests?
   
   No — any concurrent write to the SQLite metadata DB while 
_create_runs_non_partitioned is running its large loop could trigger it. The 
specific scenarios:
   
   Sequential backfill creation with a busy scheduler — even a single POST 
/api/v2/backfills for a large date range can fail if the scheduler is writing 
aggressively enough to hold the SQLite lock for >5s while the creation loop is 
running. We saw this ourselves — some of our single backfills got partially 
committed too.
   Large date range + active Airflow — the more dag runs being created (e.g. 
898 daily runs), the longer the transaction holds and the higher the chance of 
lock contention with the scheduler's heartbeat writes.
   Any other concurrent API write — creating DAG runs manually, updating task 
instances, etc. while a backfill is being created could trigger the same 
partial commit.
   So the root issue is more general: any large transaction in _create_backfill 
is not safe under concurrent SQLite writes. Concurrent backfill creation just 
makes it very easy to reproduce reliably.
   
   The fix would apply broadly — either set a longer SQLite busy timeout, or 
make _create_backfill atomic by catching OperationalError and rolling back 
cleanly before re-raising.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to