I'm afraid I disagree -- though we may be talking about two different
issues. This issue deals specifically with how to identify the "past" TI
when evaluating "depends_on_past", and shouldn't be impacted by shifting
start_date, transparently or not.

Here are three valid examples of depends_on_past DAGs that would fail to
run with the current setup:

1. A DAG with no schedule that is only run manually or via ad-hoc backfill.
Without a schedule_interval, depends_on_past will always fail (since it
looks back one schedule_interval).

2. A DAG with a schedule, but that is sometimes run off-schedule. Let's say
a scheduled run succeeds and then an off-schedule run fails. When the next
scheduled run starts, it shouldn't proceed because the most recent task
failed -- but it will look back one schedule_interval, jumping OVER the
most recent run, and decide it's ok to proceed.

3. A DAG with a schedule that is paused for a while. This DAG could be
running perfectly fine, but if it is paused for a while and then resumed,
its depends_on_past tasks will look back one schedule_interval and see
nothing, and therefore refuse to run.

So my proposal is simply that the depends_on_past logic looks back at the
most recent task as opposed to rigidly assuming there is a task one
schedule_interval ago. For a regularly scheduled DAG, this will result in
absolutely no behavior change. However it will robustly support a much
wider variety of cases like the ones I listed above.

J

On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
[email protected]> wrote:

> >>> "The clear fix seems to be to have depends_on_past check the last TI
> that
> ran, regardless of whether it ran `schedule_interval` ago. That's in line
> with the intent of the flag. I will submit a fix."
>
> I don't think so. This would lead to skipping runs, which depends_on_past
> is used as a guarantee to run every TI, sequentially.
>
> Absolute scheduling (cron expressions) is much better than relative
> scheduling (origin + interval). Though it's easy to make relative
> scheduling behave in an absolute way. You just have to use a rounded
> start_date to your schedule_interval, and not move things around. Dynamic
> start_dates have always been a problem and should probably not be
> supported, though there's no way for us to tell.
>
> Changing the schedule interval or the "origin time" is a bit tricky and
> should be documented.
>
> If people use depend_on_past=True and change the origin or the interval,
> they basically redefine what "past" actually means and will require to
> "mark success" or defining a new "start_date" as a way to say "please
> disregard depend_on_past for this date"
>
> For those who haven't fully digested "What's the deal with start_dates",
> please take the time to read it:
> http://pythonhosted.org/airflow/faq.html
>
> Also see this part of the docs:
>
> ​
>
> Max
>
> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> wrote:
>
>> Bolke, Sid and I had a brief conversation to discuss some of the
>> implications of https://github.com/airbnb/airflow/issues/1427
>>
>> There are two large points that need to be addressed:
>>
>> 1. this particular issue arises because of an alignment issue between
>> start_date and schedule_interval. This can only happen with cron-based
>> schedule_intervals that describe absolute points in time (like “1am”) as
>> opposed to time deltas (like “every hour”). Ironically, I once reported
>> this same issue myself (#959). In the past (and in the docs) we have
>> simply
>> said that users must make sure the two params agree. We discussed the
>> possibility of a DAG validation method to raise an error if the start_date
>> and schedule_interval don’t align, but Bolke made the point (and I agreed)
>> that in these cases, start_date is sort of like telling the scheduler to
>> “start paying attention” as opposed to “this is my first execution date”.
>> In #1427, the scheduler was being asked to start paying attention on
>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>> it
>> was scheduling a first run at midnight and a second run at 1:10.
>>
>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>> change the scheduling logic so that the cron-based interval takes
>> precedence over a start date. Specifically, the first date on or after the
>> start_date that complies with the schedule_interval becomes the first
>> execution date.
>>
>> 2. Issue #1 led to a second issue: depends_on_past checks for a successful
>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>> very possible for the previous TI to have run at any time in the past, not
>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>> with the situation described in point #1, where the first scheduled run is
>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>
>> The clear fix seems to be to have depends_on_past check the last TI that
>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>> with the intent of the flag. I will submit a fix.
>>
>> -J
>>
>
>

Reply via email to