> In order for this to become a reality, Backfills need to be handled by the
Airflow Scheduler as a normal DAG execution

I think it's a good idea.
It should solve natively problems like
https://github.com/apache/airflow/issues/11302

On Fri, May 24, 2024 at 10:58 PM Vikram Koka <vik...@astronomer.io.invalid>
wrote:

> Fellow Airflowers,
>
> I am following up on some of the proposed changes in the Airflow 3 proposal
> <
> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/
> >,
> where more information was requested by the community.
>
> One specific topic was "Running Backfills at scale". This is not yet a full
> fledged AIP, but a starting point for the discussion leading towards an AIP
> with fully defined technical details.
> Backfills at scale
>
> Backfills in Airflow 2.x are treated as an exception and executed by an
> incarnation of the BackfillJob, rather than the regular Airflow Scheduler
> itself. This results in unexpected interactions with the other DAGs being
> run by the main Airflow Scheduler at the same time including resource
> contention and possibly unexpected delays because established scalability
> configuration settings such as Concurrency are not consistently applied,
> and also code-level complexity by having two somewhat-similar
> implementations of scheduling logic.
>
>
> However, with ML model training, backfills are a common operation and need
> to be treated as a regular Airflow DAG / Task execution operation and not
> treated as an exception. It is also not possible to run a backfill unless
> you have direct access to the Airflow database/SSH access to the Airflow
> server , which is not possible for many/most data engineers.
>
>
> In order for this to become a reality, Backfills need to be handled by the
> Airflow Scheduler as a normal DAG execution, building on the Dynamic Task
> Mapping execution pattern, rather than an exception. Additionally, Backfill
> tasks will now ONLY be executed by the Airflow Workers, for obvious reasons
> including scalability. A less obvious, but important reason is Security,
> since it is ideal to have data connections to Enterprise data only happen
> through Airflow Workers, rather than any Airflow system components.
>
>
> As part of making Backfill support cleaner in Airflow, Backfill DAG
> execution will also be supported in the Airflow REST API.
>
>
> This proposal is purposefully light on exact implementation details but
> will include at least:
>
>
>
>    -
>
>    Making the Airflow Scheduler responsible for scheduling decisions on all
>    DagRuns (instead of the current where it purposefully ignores backfill
> runs)
>    -
>
>    A new API endpoint to submit a "backfill request".
>
>
> --
>
>
> Best regards,
> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau
>

Reply via email to