>>> "The clear fix seems to be to have depends_on_past check the last TI
that
ran, regardless of whether it ran `schedule_interval` ago. That's in line
with the intent of the flag. I will submit a fix."

I don't think so. This would lead to skipping runs, which depends_on_past
is used as a guarantee to run every TI, sequentially.

Absolute scheduling (cron expressions) is much better than relative
scheduling (origin + interval). Though it's easy to make relative
scheduling behave in an absolute way. You just have to use a rounded
start_date to your schedule_interval, and not move things around. Dynamic
start_dates have always been a problem and should probably not be
supported, though there's no way for us to tell.

Changing the schedule interval or the "origin time" is a bit tricky and
should be documented.

If people use depend_on_past=True and change the origin or the interval,
they basically redefine what "past" actually means and will require to
"mark success" or defining a new "start_date" as a way to say "please
disregard depend_on_past for this date"

For those who haven't fully digested "What's the deal with start_dates",
please take the time to read it:
http://pythonhosted.org/airflow/faq.html

Also see this part of the docs:

​

Max

On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> wrote:

> Bolke, Sid and I had a brief conversation to discuss some of the
> implications of https://github.com/airbnb/airflow/issues/1427
>
> There are two large points that need to be addressed:
>
> 1. this particular issue arises because of an alignment issue between
> start_date and schedule_interval. This can only happen with cron-based
> schedule_intervals that describe absolute points in time (like “1am”) as
> opposed to time deltas (like “every hour”). Ironically, I once reported
> this same issue myself (#959). In the past (and in the docs) we have simply
> said that users must make sure the two params agree. We discussed the
> possibility of a DAG validation method to raise an error if the start_date
> and schedule_interval don’t align, but Bolke made the point (and I agreed)
> that in these cases, start_date is sort of like telling the scheduler to
> “start paying attention” as opposed to “this is my first execution date”.
> In #1427, the scheduler was being asked to start paying attention on
> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However, it
> was scheduling a first run at midnight and a second run at 1:10.
>
> Regardless of whether we choose to validate/warn/error, Bolke is going to
> change the scheduling logic so that the cron-based interval takes
> precedence over a start date. Specifically, the first date on or after the
> start_date that complies with the schedule_interval becomes the first
> execution date.
>
> 2. Issue #1 led to a second issue: depends_on_past checks for a successful
> TI at `execution_date - schedule_interval`. This is fragile, since it is
> very possible for the previous TI to have run at any time in the past, not
> just one schedule_interval ago. This can happen easily with ad-hoc DAG
> runs, and also if a DAG was paused for a while. Less commonly, it happens
> with the situation described in point #1, where the first scheduled run is
> off-schedule (the midnight run followed by the daily 1:10am runs).
>
> The clear fix seems to be to have depends_on_past check the last TI that
> ran, regardless of whether it ran `schedule_interval` ago. That's in line
> with the intent of the flag. I will submit a fix.
>
> -J
>

Reply via email to