mwisnicki opened a new issue, #68721:
URL: https://github.com/apache/airflow/issues/68721

   ### Under which category would you file this issue?
   
   Airflow Core
   
   ### Apache Airflow version
   
   3.2.2
   
   ### What happened and how to reproduce it?
   
   Again more slop but hopefully useful enough.
   
   `<🤖>`
   ---
   
   When a backfill is created for a DAG with fast-completing tasks (sub-second 
per run),
   the scheduler marks the backfill as complete before all queued runs have 
been executed.
   
   The root cause is in `_mark_backfills_complete` (`scheduler_job_runner.py` 
~line 1967),
   which runs every 30 seconds and marks a backfill complete when no dag runs 
are in
   `running` or `queued` state:
   
   ```python
   ~exists(
       select(DagRun.id).where(
           and_(DagRun.backfill_id == Backfill.id, 
DagRun.state.in_(unfinished_states))
       )
   )
   ```
   
   When tasks complete faster than the scheduler's next scheduling loop can 
queue new
   runs, there is a window where all current runs are `success` and the next 
batch has
   not yet been dispatched. The completion check fires in this window and 
incorrectly
   marks the backfill done, leaving remaining queued runs stranded.
   
   **To reproduce:**
   
   1. Create a DAG with a no-op task and a long date range:
   
   ```python
   from airflow.sdk import dag, task
   from datetime import datetime
   
   @dag(
       dag_id="test_backfill_bug",
       schedule="@daily",
       start_date=datetime(2020, 1, 1),
       end_date=datetime(2022, 12, 31),
       catchup=False,
   )
   def test_backfill_bug():
       @task
       def noop():
           pass
       noop()
   
   test_backfill_bug()
   ```
   
   2. Create a backfill:
   
   ```bash
   airflow backfill create \
     --dag-id test_backfill_bug \
     --from-date 2020-01-01 \
     --to-date 2022-12-31 \
     --max-active-runs 10
   ```
   
   3. Observe that the backfill completes having only processed a fraction of 
the 1096 runs:
   
   ```python
   import sqlite3, os
   conn = sqlite3.connect(os.path.expanduser('~/airflow/airflow.db'))
   cur = conn.cursor()
   cur.execute('SELECT id, completed_at FROM backfill WHERE 
dag_id="test_backfill_bug"')
   b_id, completed_at = cur.fetchone()
   cur.execute('SELECT state, COUNT(*) FROM dag_run WHERE backfill_id=? GROUP 
BY state', (b_id,))
   print('completed_at:', completed_at)
   for row in cur.fetchall(): print(row)
   conn.close()
   ```
   
   **Observed output:**
   
   ```
   completed_at: 2026-06-18 03:42:47.759611
   ('success', 441)
   ('queued', 455)    <- remaining runs never executed
   ```
   
   
   ### What you think should happen instead?
   
   The backfill should only be marked complete when all dag runs associated 
with it have
   reached a terminal state (`success` or `failed`), regardless of whether 
there is a
   momentary window where none are `running` or `queued`.
   
   A possible fix: check that the count of terminal dag runs equals the total
   `BackfillDagRun` associations (excluding skipped entries) before marking 
complete.
   
   ### Operating System
   
   macOS
   
   ### Deployment
   
   Virtualenv installation
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   _No response_
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   Note: PR #62561 (merged in 3.2.2) fixed a related but distinct issue where a 
backfill
   was marked complete before *any* dag runs were created (zero-runs race). 
This issue
   occurs after dag runs are created and begin executing — the completion 
window opens
   between scheduling batches when tasks complete faster than new ones are 
dispatched.
   
   Related issue: the SQLite `database is locked` error (see
   [apache/airflow#68699](https://github.com/apache/airflow/issues/68699)) can 
cause fewer
   dag runs to be created than expected, which makes this bug easier to trigger 
since
   fewer runs complete faster.
   
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to