For this kind of use case someone should just use a TimeDeltaSensor as a
root task in their DAG if 1:10a is the "trigger". I'm suspecting that
person could use a better trigger than a TimeSensor though. What happens at
1:10a? If it's "the time at which the database scrapes are usually done",
that user should just have a more explicit sensor validating that the
scrapes a done, perhaps look for the data itself or some trigger file /
partition, ...

Max

On Tue, Apr 26, 2016 at 10:30 AM, Siddharth Anand <
[email protected]> wrote:

> I have the following concerns, which may be best illustrated with
> examples:
> Start date : 2016-01-01Schedule Interval : */20 * * * *  (run every 20
> mins)
> Expectation : Runs on
>    - 20160101T000000
>
>    - 20160101T002000
>
>    - 20160101T004000
>
>    - 20160101T014000
>
>    - ...
>
> Start date : 2016-01-01Schedule Interval : 10 1 * * *  (run at 1:10 every
> day)
> Expectation : Runs on
>    - 20160101T011000
>
>    - 20160102T011000
>
>    - 20160103T011000
>
>    - 20160104T011000
>    - ...
>
> In my mind, the second case is invalid. I suspect he wants to run a daily
> ETL job which is aligned on day boundaries but he wants data to settle
> before kicking off the job at 1:10a every day due to late arriving data.
> Though we can kick the run off at 1:10a every day, the execution date for
> each run will be 20160103T011000 every day. The problem is that the ETL
> actually needs to cover the day boundaries from midnight-to-midnight each
> day, not from 1:10a to 1:10a. So, what happens is that the dag runs and the
> task instance execution dates will reflect the scheduling interval, not the
> dates covered by the ETL process. The developer, will need to do his own
> date trimming inside his DAG to get to the day boundary (i.e.
> 20160101T011000 --> 20160101T000000) and his history will reflect
> scheduling instead of data coverage period.
> Today, because we require alignment, the instance dates and dag runs
> reflect both the data coverage period and the schedule. Hence, the
> execution date can be passed as window arguments to downstream systems.
> Does this make sense or am I concerned about a non-issue? Essentially, this
> change will be break existing functionality without people being aware.
> -s
>
>     On Tuesday, April 26, 2016 9:42 AM, Jeremiah Lowin <[email protected]>
> wrote:
>
>
>  On further thought, I understand the issue you're describing where this
> could lead to out-of-order runs. In fact, Bolke alerted me to the
> possibility earlier but I didn't make the connection! That feels like a
> separate issue -- to guarantee that tasks are executed in order (and more
> importantly that their database entries are created). I think the
> depends_on_past issue is related but separate -- though clearly needs to be
> fleshed out for all cases :)
>
> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <[email protected]> wrote:
>
> > I'm afraid I disagree -- though we may be talking about two different
> > issues. This issue deals specifically with how to identify the "past" TI
> > when evaluating "depends_on_past", and shouldn't be impacted by shifting
> > start_date, transparently or not.
> >
> > Here are three valid examples of depends_on_past DAGs that would fail to
> > run with the current setup:
> >
> > 1. A DAG with no schedule that is only run manually or via ad-hoc
> > backfill. Without a schedule_interval, depends_on_past will always fail
> > (since it looks back one schedule_interval).
> >
> > 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
> > say a scheduled run succeeds and then an off-schedule run fails. When the
> > next scheduled run starts, it shouldn't proceed because the most recent
> > task failed -- but it will look back one schedule_interval, jumping OVER
> > the most recent run, and decide it's ok to proceed.
> >
> > 3. A DAG with a schedule that is paused for a while. This DAG could be
> > running perfectly fine, but if it is paused for a while and then resumed,
> > its depends_on_past tasks will look back one schedule_interval and see
> > nothing, and therefore refuse to run.
> >
> > So my proposal is simply that the depends_on_past logic looks back at the
> > most recent task as opposed to rigidly assuming there is a task one
> > schedule_interval ago. For a regularly scheduled DAG, this will result in
> > absolutely no behavior change. However it will robustly support a much
> > wider variety of cases like the ones I listed above.
> >
> > J
> >
> >
> > On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
> > [email protected]> wrote:
> >
> >> >>> "The clear fix seems to be to have depends_on_past check the last TI
> >> that
> >> ran, regardless of whether it ran `schedule_interval` ago. That's in
> line
> >> with the intent of the flag. I will submit a fix."
> >>
> >> I don't think so. This would lead to skipping runs, which
> depends_on_past
> >> is used as a guarantee to run every TI, sequentially.
> >>
> >> Absolute scheduling (cron expressions) is much better than relative
> >> scheduling (origin + interval). Though it's easy to make relative
> >> scheduling behave in an absolute way. You just have to use a rounded
> >> start_date to your schedule_interval, and not move things around.
> Dynamic
> >> start_dates have always been a problem and should probably not be
> >> supported, though there's no way for us to tell.
> >>
> >> Changing the schedule interval or the "origin time" is a bit tricky and
> >> should be documented.
> >>
> >> If people use depend_on_past=True and change the origin or the interval,
> >> they basically redefine what "past" actually means and will require to
> >> "mark success" or defining a new "start_date" as a way to say "please
> >> disregard depend_on_past for this date"
> >>
> >> For those who haven't fully digested "What's the deal with start_dates",
> >> please take the time to read it:
> >> http://pythonhosted.org/airflow/faq.html
> >>
> >> Also see this part of the docs:
> >>
> >> ​
> >>
> >> Max
> >>
> >> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]>
> wrote:
> >>
> >>> Bolke, Sid and I had a brief conversation to discuss some of the
> >>> implications of https://github.com/airbnb/airflow/issues/1427
> >>>
> >>> There are two large points that need to be addressed:
> >>>
> >>> 1. this particular issue arises because of an alignment issue between
> >>> start_date and schedule_interval. This can only happen with cron-based
> >>> schedule_intervals that describe absolute points in time (like “1am”)
> as
> >>> opposed to time deltas (like “every hour”). Ironically, I once reported
> >>> this same issue myself (#959). In the past (and in the docs) we have
> >>> simply
> >>> said that users must make sure the two params agree. We discussed the
> >>> possibility of a DAG validation method to raise an error if the
> >>> start_date
> >>> and schedule_interval don’t align, but Bolke made the point (and I
> >>> agreed)
> >>> that in these cases, start_date is sort of like telling the scheduler
> to
> >>> “start paying attention” as opposed to “this is my first execution
> date”.
> >>> In #1427, the scheduler was being asked to start paying attention on
> >>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00.
> However,
> >>> it
> >>> was scheduling a first run at midnight and a second run at 1:10.
> >>>
> >>> Regardless of whether we choose to validate/warn/error, Bolke is going
> to
> >>> change the scheduling logic so that the cron-based interval takes
> >>> precedence over a start date. Specifically, the first date on or after
> >>> the
> >>> start_date that complies with the schedule_interval becomes the first
> >>> execution date.
> >>>
> >>> 2. Issue #1 led to a second issue: depends_on_past checks for a
> >>> successful
> >>> TI at `execution_date - schedule_interval`. This is fragile, since it
> is
> >>> very possible for the previous TI to have run at any time in the past,
> >>> not
> >>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
> >>> runs, and also if a DAG was paused for a while. Less commonly, it
> happens
> >>> with the situation described in point #1, where the first scheduled run
> >>> is
> >>> off-schedule (the midnight run followed by the daily 1:10am runs).
> >>>
> >>> The clear fix seems to be to have depends_on_past check the last TI
> that
> >>> ran, regardless of whether it ran `schedule_interval` ago. That's in
> line
> >>> with the intent of the flag. I will submit a fix.
> >>>
> >>> -J
> >>>
> >>
> >>
>
>
>

Reply via email to