mwisnicki commented on issue #68699: URL: https://github.com/apache/airflow/issues/68699#issuecomment-4738551761
I think it's not limited to concurrent backfills since I also triggered it with just one `<🤖>` --- > is this bug limited to concurrent backfill requests? No — any concurrent write to the SQLite metadata DB while _create_runs_non_partitioned is running its large loop could trigger it. The specific scenarios: Sequential backfill creation with a busy scheduler — even a single POST /api/v2/backfills for a large date range can fail if the scheduler is writing aggressively enough to hold the SQLite lock for >5s while the creation loop is running. We saw this ourselves — some of our single backfills got partially committed too. Large date range + active Airflow — the more dag runs being created (e.g. 898 daily runs), the longer the transaction holds and the higher the chance of lock contention with the scheduler's heartbeat writes. Any other concurrent API write — creating DAG runs manually, updating task instances, etc. while a backfill is being created could trigger the same partial commit. So the root issue is more general: any large transaction in _create_backfill is not safe under concurrent SQLite writes. Concurrent backfill creation just makes it very easy to reproduce reliably. The fix would apply broadly — either set a longer SQLite busy timeout, or make _create_backfill atomic by catching OperationalError and rolling back cleanly before re-raising. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
