kimyen commented on issue #18317:
URL: https://github.com/apache/airflow/issues/18317#issuecomment-924377251


   > What I'm after is a way to insert multiple dag-runs for historical dates 
in bulk from the UI, possibly with some tasks already marked as 
complete/skipped, as well as clearing tasks/dagruns between certain dates.
   
   @thejens without knowing the explicit use case, seeing all the things you 
listed here prompt me to this doc I wrote. I think that there are already 
multiple ways to "backfill" using the scheduler. See Manual Backfill > Airflow 
builtin options.
   
   *Notes that the "backfill request view" in the below doc is for the first 
backfill tool I mentioned above.*
   
   # How to backfill your DAG
   
   ## Background
   
   There are multiple ways to backfill a DAG in Airflow. We will attempt to 
describe when to use each option.
   
   ### Use case 1: New DAG 
   
   A new DAG is created on April 4th, 2020 and we want the DAG to start 
collecting data since March 1st, 2020. 
   To achieve this, while writing the DAG definition, we can set `catchup=True` 
and `"start_date": datetime(2020, 3, 1)` in the DAG's `default_args`.
   
   When the code is deployed to production, the backfill (from March 1st to 
current time) is automatically started by the scheduler.
   
   ### Use case 2: Extend DAG runs further in the past
   
   An existing DAG has DAG runs starting from March 1st, 2020. We want to 
extend it to January 1st, 2020. We can achieve this by:
   - ensuring that the DAG has `catchup=True`, and
   - change the start date to January 1st, 2020
   
   When the code is deployed to production, the backfill (from January 1st to 
March 1st) is automatically started by the scheduler.
   **If there are any successful DAG runs after the start date, Airflow is not 
going to `catchup`. See [Start from Dag Runs view](#start-from-dag-runs-view) 
to delete the successful DAG run and trigger the catchup process.**
   
   This can also be achieved by manually backfilling the DAG from January 1st 
to March 1st. See [manual backfill section](#manual-backfill).
   
   ### Use case 3: DAG logic change
   
   An existing DAG was used to calculated some metrics. However, the 
calculations need to be updated, and all past successful DAG runs need to be 
rerun to update the resulting data for those days.
   For example, the change was deployed to production on May 1st, 2020. The May 
1st DAG run is then scheduled to run and uses the new DAG logic. However, all 
prior DAG runs from January 1st, 2020 to April 30th, 2020 need to be re-run. 
   
   In this case, a manual backfill needs to be triggered. See [manual backfill 
section](#manual-backfill).
   
   ### Use case 4: New Task
   
   A new task is added to an existing DAG. Regardless if your DAG has 
`catchup=True`, since the existing DAG runs have been completed, the scheduler 
will not automatically trigger backfill runs for the new task.
   
   In this case, a manual backfill needs to be triggered. See [manual backfill 
section](#manual-backfill).
   
   
   ## Manual backfill
   
   ### Understand backfill and scheduler
   
   [This page](https://airflow.apache.org/docs/stable/scheduler.html) describes 
how the Airflow scheduler, catchup, backfill, and external triggers works. We 
highly recommend that you understand these concepts before performing a manual 
backfill.
   
   ### Things to check before starting a manual backfill
   
   Answer to these questions will also help to choose a more fit option to 
perform backfill.
   
   - What: Backfill the entire DAG or a specific task(s)
   - When: Time range you want to run the backfill
   - Pre-condition: Are there dependencies for your DAG/task? If so, have the 
dependencies been met or the pre-conditions been satisfied?
   - How:
     - Are there existing DAG runs for the period you want to backfill? If so, 
do you want to re-run or skip them?
     - If you want to backfill a specific task, can the upstream task(s) be 
ignored?
   - Impact:
     - What data does this backfill change?
     - Would it result in duplicated data?
   
   ### Airflow built in options
   
   Airflow has multiple built in options to trigger a backfill manually. 
   
   #### Start from Tree view
   
   The tree view option is best fit if you only want to backfill a handful of 
specific tasks. 
   
   **Only use this option if the task's dependencies has been met.**
   
   Go to your DAG's tree view:
   
   - click on the task you want to backfill
   - On the line with the "Run" button, click "Ignore Task State"
   - Click "Run"
   
   
   #### Start from Task Instance view
   
   The Task Intance view option is the best fit if you want to backfill more 
than a handful of specific tasks. (When clicking on each task and start running 
them manually is taking too much time.) 
   
   **Only use this option if:**
   - your DAG is configured with `catchup=True`,
   - the tasks' dependencies has been met,
   - the backfill period is covered by the DAG's `start_date` - `end_date` 
range.
   - there are existing task runs for the task(s) you want to backfill
   
   This option is accomplished by clearing out the task instances for the task 
run(s) you want to backfill. The scheduler will schedule new task runs to fill 
in the ones that have been cleared out.
   
   Let's imagine we want to backfill `fill_meu_v2_retention` from 
`key_metrics_cube` DAG between `2019-10-01` and `2019-10-10`. Here are the 
steps to carry out this option:
   
   * Go to `Browse` -> `Task Instances` and find the tasks you want to 
backfill. 
   
   * In this example, the target backfill task is `fill_meu_v2_retention`. 
However, we also need to clear the `drop_meu_v2_retention` task to make sure 
data is not duplicated. Select all the task runs for `fill_meu_v2_retention` 
and `drop_meu_v2_retention` during the backfill period; and click on `With 
selected` -> `clear`
   
   ![Screen Shot 2019-10-18 at 11 43 31 
AM](https://user-images.githubusercontent.com/11540582/67084130-a52ea300-f19c-11e9-993d-13e03f89a65c.png)
   
   * Airflow is not going to `catchup` if there are already completed `DAG 
Runs`, we need to clear those up to trigger the `catchup` process. See [Start 
from Dag Runs view](#start-from-dag-runs-view).
   
   You should start seeing tasks running shortly.
   
   #### Start from Dag Runs view
   
   The DAG Runs view is the best fit to re-run DAG(s)/task(s) that already have 
DAG runs.
   
   **Only use this option if:**
   - your DAG is configured with `catchup=True`,
   - the tasks' dependencies have been met,
   - the backfill period is covered by the DAG's `start_date` - `end_date` 
range,
   - the task runs for the task you want to backfill do not exist, or have been 
deleted using the instruction [here](#start-from-task-instance-view). If not, 
please see [Start from Task Instance view](#start-from-task-instance-view)
   
   In order to do this, you need to:
   - pause the DAG
   - go to `Browse` -> `DAG Runs`
   - `Add Filter` and filter by `Dag Id`
   - Once you've found all the DAG Runs within your backfill period, select all 
of them and delete them.
   - `Unpause` your DAG to trigger the `catchup` process. 
   
   Once you've deleted the DAG runs, these DAG runs will disappear on the Tree 
view. However, any task instances for those DAG runs (which existed before you 
deleted the DAG runs) will still be there. The scheduler will automatically 
schedule new DAG runs and only run the tasks that have not been completed.
   
   ### Backfill Request view
   
   The [backfill request 
view](https://airflow.githubapp.com/admin/backfillrequest/) is built on top of 
Airflow CLI and is best fit for backfilling a long period of time.
   
   **Only use this option if:**
   - The backfill period will not be worked on by the scheduler:
     - The backfill period is not within the DAG's `start_date` - `end_date` 
range; or
     - There are existing DAG runs for every DAG run in the backfill period; or
     - The DAG has `catchup=False`; or
     - The DAG is paused.
   
   Use the Backfill request UI to submit a backfill request. Backfill requests 
are first come first served. You can watch the position of your request in the 
queue by looking at the [Status 
tab](https://airflow.githubapp.com/admin/backfillrequest#status).
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to