> In order for this to become a reality, Backfills need to be handled by the Airflow Scheduler as a normal DAG execution
I think it's a good idea. It should solve natively problems like https://github.com/apache/airflow/issues/11302 On Fri, May 24, 2024 at 10:58 PM Vikram Koka <vik...@astronomer.io.invalid> wrote: > Fellow Airflowers, > > I am following up on some of the proposed changes in the Airflow 3 proposal > < > https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/ > >, > where more information was requested by the community. > > One specific topic was "Running Backfills at scale". This is not yet a full > fledged AIP, but a starting point for the discussion leading towards an AIP > with fully defined technical details. > Backfills at scale > > Backfills in Airflow 2.x are treated as an exception and executed by an > incarnation of the BackfillJob, rather than the regular Airflow Scheduler > itself. This results in unexpected interactions with the other DAGs being > run by the main Airflow Scheduler at the same time including resource > contention and possibly unexpected delays because established scalability > configuration settings such as Concurrency are not consistently applied, > and also code-level complexity by having two somewhat-similar > implementations of scheduling logic. > > > However, with ML model training, backfills are a common operation and need > to be treated as a regular Airflow DAG / Task execution operation and not > treated as an exception. It is also not possible to run a backfill unless > you have direct access to the Airflow database/SSH access to the Airflow > server , which is not possible for many/most data engineers. > > > In order for this to become a reality, Backfills need to be handled by the > Airflow Scheduler as a normal DAG execution, building on the Dynamic Task > Mapping execution pattern, rather than an exception. Additionally, Backfill > tasks will now ONLY be executed by the Airflow Workers, for obvious reasons > including scalability. A less obvious, but important reason is Security, > since it is ideal to have data connections to Enterprise data only happen > through Airflow Workers, rather than any Airflow system components. > > > As part of making Backfill support cleaner in Airflow, Backfill DAG > execution will also be supported in the Airflow REST API. > > > This proposal is purposefully light on exact implementation details but > will include at least: > > > > - > > Making the Airflow Scheduler responsible for scheduling decisions on all > DagRuns (instead of the current where it purposefully ignores backfill > runs) > - > > A new API endpoint to submit a "backfill request". > > > -- > > > Best regards, > Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau >