Hi Boris, To answer the first question, the backfill command has a flag to mark jobs as successful without running them. Take care to align the start and end times precisely as needed. As an example, for a job that runs daily at 7am:
airflow backfill -s 2016-10-07T07 -e 2016-10-10T07 my-dag-name -m The "-m" parameter tells Airflow to mark it successful without running it. On Thu, Oct 13, 2016 at 10:46 AM, Boris Tyukin <[email protected]> wrote: > Hello all and thanks for such an amazing project! I have been evaluating > Airflow and spent a few days reading about it and playing with it and I > have a few questions that I struggle to understand. > > Let's say I have a simple DAG that runs once a day and it is doing a full > reload of tables from the source database so the process is not > incremental. > > Let's consider this scenario: > > Day 1 - OK > > Day 2 - airflow scheduler or server with airflow is down for some reason > ((or > DAG is paused) > > Day 3 - still down(or DAG is paused) > > Day 4 - server is up and now needs to run missing jobs. > > > How can I make airflow to run only Day 4 job and not backfill Day 2 and 3? > > > I tried to do depend_on_past = True but it does not seem to do this trick. > > > I also found in a roadmap doc this but seems it is not made to the release > yet: > > > Only Run Latest - Champion : Sid > > • For cases where we need to only run the latest in a series of task > instance runs and mark the others as skipped. For example, we may have job > to execute a DB snapshot every day. If the DAG is paused for 5 days and > then unpaused, we don’t want to run all 5, just the latest. With this > feature, we will provide “cron” functionality for task scheduling that is > not related to ETL > > > My second question, what if I have another DAG that does incremental loads > from a source table: > > > Day 1 - OK, loaded new/changed data for previous day > > Day 2 - source system is down (or DAG is paused), Airflow DagRun failed > > Day 3 - source system is down (or DAG is paused), Airflow DagRun failed > > Day 4 - source system is up, Airflow Dagrun succeeded > > > My problem (unless I am missing something), Airflow on Day 4 would use > execution time from Day 3, so the interval for incremental load would be > since the last run (which was Failed). My hope it would use the last > _successful_ run so on Day 4 it would go back to Day 1. Is it possible to > achieve this? > > I am aware of a manual backfill command via CLI but I am not sure I want to > use due to all the issues and inconsistencies I've read about it. > > Thanks! > -- *Joe Napolitano *| Sr. Data Engineer www.blueapron.com | 5 Crosby Street, New York, NY 10013
