GitHub user peter edited a comment on the discussion: Add the ability to backfill a DAG based on past Asset Events
I disagree with this statement from the Airflow backfill documentation: **“Backfill does not make sense for Dags that don’t have a time-based schedule.”** I feel like the Airflow creators conflate two different and orthogonal things: * Whether a DAG works on time/date based (partitioned) data (which our DAG does in this case) * Whether a DAG is triggered on a fixed schedule or by an event To say that just because a DAG is normally triggered by an event a backfill doesn’t make sense just doesn’t hold. My use case is that I have a DAG that populates a BigQuery table and is triggered via an asset from an upstream DAG. The data in BigQuery is partitioned by date and the reason I need to backfill is that the schema of the BigQuery table has changed. The upstream data has not changed. * I do not want to re-run the upstream DAG because it doesn't need backfilling (its data hasn't changed) * I do not want to create asset events as those events would indicate that something changed in the data generated by the upstream DAG which is not the case and so would be misleading IMHO * I do not want to clear out existing DAG runs as this would delete history that is potentially valuable I have a Python script that invokes the Airflow REST API to create DAG runs for the date range I am interested in. The script creates one DAG run for each `logical_date` in the date range. However this is difficult since there is a unique constraint in the database on DAG ID and logical_date. So in order for this script to work I would need to get a complete list of all historic DAG runs and then delete all DAG runs that overlap with the DAG runs that I want to create. Two problems with this approach: * It is making something that should be easy and straight forward quite complex * I do not want to delete historic DAG run information (as I mentioned above) I think the fundamental question underlying the discussion is this: **Is backfilling a one-off thing that Airflow doesn't need to support and that you can handle yourself with a custom script or is it something we should expect to happen every now and then and that Airflow should support?** GitHub link: https://github.com/apache/airflow/discussions/59886#discussioncomment-15494701 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
