Yes, the first DagRun will be inferred from the min(start_date). There are a few subtleties, mostly around dealing with what happens if your start_date doesn't fit on your defined cronned schedule
Here's the piece of code that schedules the first DagRun if you want to more details: https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L424 On Sun, Jun 12, 2016 at 9:59 PM, harish singh <[email protected]> wrote: > :) I did read this before posting. > The question I have is: > > Say I have 3 DAGS. Lets say I set > 'start_date' : datetime(2015, 6, 1) > > Now, in my pipeline.py, if I add a dynamically query some database table > and create DAGS. > Lets say tomorrow if I add a new DAG. > That new DAG will get the same start_date = datetime(2015, 6, 1). > Which means, the pipeline for this new dag will start from datetime(2015, > 6 > , 1) and not from datetime.now(). > > I am trying to understand what is a correct approach for setitng this param > so that it becomes flexible and extensible for future dags? > > > On Sun, Jun 12, 2016 at 4:42 PM, Maxime Beauchemin < > [email protected]> wrote: > > > From: http://pythonhosted.org/airflow/faq.html > > > > *What’s the deal with ``start_date``?* > > > > start_date is partly legacy from the pre-DagRun era, but it is still > > relevant in many ways. When creating a new DAG, you probably want to set > a > > global start_date for your tasks usingdefault_args. The first DagRun to > be > > created will be based on the min(start_date) for all your task. From that > > point on, the scheduler creates new DagRuns based on your > > schedule_interval and > > the corresponding task instances run as your dependencies are met. When > > introducing new tasks to your DAG, you need to pay special attention to > > start_date, and may want to reactivate inactive DagRuns to get the new > task > > to get onboarded properly. > > > > We recommend against using dynamic values as start_date, especially > > datetime.now() as it can be quite confusing. The task is triggered once > the > > period closes, and in theory an @hourly DAG would never get to an hour > > after now as now() moves along. > > > > We also recommend using rounded start_date in relation to your > > schedule_interval. This means an @hourly would be at 00:00 > minutes:seconds, > > a @daily job at midnight, a @monthly job on the first of the month. You > can > > use any sensor or a TimeDeltaSensor to delay the execution of tasks > within > > that period. While schedule_interval does allow specifying a > > datetime.timedelta object, we recommend using the macros or cron > > expressions instead, as it enforces this idea of rounded schedules. > > > > When using depends_on_past=True it’s important to pay special attention > to > > start_date as the past dependency is not enforced only on the specific > > schedule of the start_date specified for the task. It’ also important to > > watch DagRun activity status in time when introducing new > > depends_on_past=True, unless you are planning on running a backfill for > the > > new task(s). > > > > Also important to note is that the tasks start_date, in the context of a > > backfill CLI command, get overridden by the backfill’s command > start_date. > > This allows for a backfill on tasks that havedepends_on_past=True to > > actually start, if it wasn’t the case, the backfill just wouldn’t start. > > > > On Sun, Jun 12, 2016 at 3:17 PM, harish singh <[email protected]> > > wrote: > > > > > These are the default args to my DAG. > > > I am trying to run a standard hourly job (basically, at the end of > > > this hour, process last hours data) > > > I noticed that my pipeline is 1 hour late. > > > > > > For some reason, I am messing up with my start_date I guess. > > > What is the best practice for setting up start_date? > > > > > > > > > scheduling_start_date = (datetime.utcnow()).replace(minute=0, > > > second=0, microsecond=0) + > > > datetime.timedelta(minutes=15)default_schedule_interval = > > > datetime.timedelta(minutes=60)default_args = { > > > > > > 'owner': 'airflow', > > > 'depends_on_past': False, > > > 'start_date': scheduling_start_date, > > > 'email': ['[email protected]'], > > > 'email_on_failure': False, > > > 'email_on_retry': False, > > > 'retries': 2, > > > 'retry_delay': default_retries_delay, 'schedule_interval'= > > > default_schedule_interval > > > > > > # 'queue': 'bash_queue', > > > # 'pool': 'backfill', > > > # 'priority_weight': 10, > > > # 'end_date': datetime(2016, 1, 1), > > > } > > > > > >
