Re: airflow start_date confusion:

Maxime Beauchemin Sun, 12 Jun 2016 23:29:34 -0700

Yes, the first DagRun will be inferred from the min(start_date). There are
a few subtleties, mostly around dealing with what happens if your
start_date doesn't fit on your defined cronned schedule


Here's the piece of code that schedules the first DagRun if you want to
more details:
https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L424

On Sun, Jun 12, 2016 at 9:59 PM, harish singh <[email protected]>
wrote:

> :) I did read this before posting.
> The question I have is:
>
> Say I have 3 DAGS.  Lets say I set
> 'start_date' : datetime(2015, 6, 1)
>
> Now, in my pipeline.py, if I add a dynamically query some database table
> and create DAGS.
> Lets say tomorrow if I add a new DAG.
> That new DAG will get the same start_date = datetime(2015, 6, 1).
> Which means, the pipeline for this new dag will start from  datetime(2015,
> 6
> , 1) and not from datetime.now().
>
> I am trying to understand what is a correct approach for setitng this param
> so that it becomes flexible and extensible for future dags?
>
>
> On Sun, Jun 12, 2016 at 4:42 PM, Maxime Beauchemin <
> [email protected]> wrote:
>
> > From: http://pythonhosted.org/airflow/faq.html
> >
> > *What’s the deal with ``start_date``?*
> >
> > start_date is partly legacy from the pre-DagRun era, but it is still
> > relevant in many ways. When creating a new DAG, you probably want to set
> a
> > global start_date for your tasks usingdefault_args. The first DagRun to
> be
> > created will be based on the min(start_date) for all your task. From that
> > point on, the scheduler creates new DagRuns based on your
> > schedule_interval and
> > the corresponding task instances run as your dependencies are met. When
> > introducing new tasks to your DAG, you need to pay special attention to
> > start_date, and may want to reactivate inactive DagRuns to get the new
> task
> > to get onboarded properly.
> >
> > We recommend against using dynamic values as start_date, especially
> > datetime.now() as it can be quite confusing. The task is triggered once
> the
> > period closes, and in theory an @hourly DAG would never get to an hour
> > after now as now() moves along.
> >
> > We also recommend using rounded start_date in relation to your
> > schedule_interval. This means an @hourly would be at 00:00
> minutes:seconds,
> > a @daily job at midnight, a @monthly job on the first of the month. You
> can
> > use any sensor or a TimeDeltaSensor to delay the execution of tasks
> within
> > that period. While schedule_interval does allow specifying a
> > datetime.timedelta object, we recommend using the macros or cron
> > expressions instead, as it enforces this idea of rounded schedules.
> >
> > When using depends_on_past=True it’s important to pay special attention
> to
> > start_date as the past dependency is not enforced only on the specific
> > schedule of the start_date specified for the task. It’ also important to
> > watch DagRun activity status in time when introducing new
> > depends_on_past=True, unless you are planning on running a backfill for
> the
> > new task(s).
> >
> > Also important to note is that the tasks start_date, in the context of a
> > backfill CLI command, get overridden by the backfill’s command
> start_date.
> > This allows for a backfill on tasks that havedepends_on_past=True to
> > actually start, if it wasn’t the case, the backfill just wouldn’t start.
> >
> > On Sun, Jun 12, 2016 at 3:17 PM, harish singh <[email protected]>
> > wrote:
> >
> > > These are the default args to my DAG.
> > > I am trying to run a standard hourly job (basically, at the end of
> > > this hour, process last hours data)
> > > I noticed that my pipeline is 1 hour late.
> > >
> > > For some reason, I am messing up with my start_date I guess.
> > > What is the best practice for setting up start_date?
> > >
> > >
> > > scheduling_start_date = (datetime.utcnow()).replace(minute=0,
> > > second=0, microsecond=0) +
> > > datetime.timedelta(minutes=15)default_schedule_interval =
> > > datetime.timedelta(minutes=60)default_args = {
> > >
> > >     'owner': 'airflow',
> > >     'depends_on_past': False,
> > >     'start_date': scheduling_start_date,
> > >     'email': ['[email protected]'],
> > >     'email_on_failure': False,
> > >     'email_on_retry': False,
> > >     'retries': 2,
> > >     'retry_delay': default_retries_delay,    'schedule_interval'=
> > > default_schedule_interval
> > >
> > >     # 'queue': 'bash_queue',
> > >     # 'pool': 'backfill',
> > >     # 'priority_weight': 10,
> > >     # 'end_date': datetime(2016, 1, 1),
> > > }
> > >
> >
>

Re: airflow start_date confusion:

Reply via email to