Fellow Airflowers,

I am following up on some of the proposed changes in the Airflow 3 proposal
<https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/>,
where more information was requested by the community.

One specific topic was "Running Backfills at scale". This is not yet a full
fledged AIP, but a starting point for the discussion leading towards an AIP
with fully defined technical details.
Backfills at scale

Backfills in Airflow 2.x are treated as an exception and executed by an
incarnation of the BackfillJob, rather than the regular Airflow Scheduler
itself. This results in unexpected interactions with the other DAGs being
run by the main Airflow Scheduler at the same time including resource
contention and possibly unexpected delays because established scalability
configuration settings such as Concurrency are not consistently applied,
and also code-level complexity by having two somewhat-similar
implementations of scheduling logic.


However, with ML model training, backfills are a common operation and need
to be treated as a regular Airflow DAG / Task execution operation and not
treated as an exception. It is also not possible to run a backfill unless
you have direct access to the Airflow database/SSH access to the Airflow
server , which is not possible for many/most data engineers.


In order for this to become a reality, Backfills need to be handled by the
Airflow Scheduler as a normal DAG execution, building on the Dynamic Task
Mapping execution pattern, rather than an exception. Additionally, Backfill
tasks will now ONLY be executed by the Airflow Workers, for obvious reasons
including scalability. A less obvious, but important reason is Security,
since it is ideal to have data connections to Enterprise data only happen
through Airflow Workers, rather than any Airflow system components.


As part of making Backfill support cleaner in Airflow, Backfill DAG
execution will also be supported in the Airflow REST API.


This proposal is purposefully light on exact implementation details but
will include at least:



   -

   Making the Airflow Scheduler responsible for scheduling decisions on all
   DagRuns (instead of the current where it purposefully ignores backfill runs)
   -

   A new API endpoint to submit a "backfill request".


--


Best regards,
Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau

Reply via email to