I have the following concerns, which may be best illustrated with examples:
Start date : 2016-01-01Schedule Interval : */20 * * * * (run every 20 mins)
Expectation : Runs on
- 20160101T000000
- 20160101T002000
- 20160101T004000
- 20160101T014000
- ...
Start date : 2016-01-01Schedule Interval : 10 1 * * * (run at 1:10 every day)
Expectation : Runs on
- 20160101T011000
- 20160102T011000
- 20160103T011000
- 20160104T011000
- ...
In my mind, the second case is invalid. I suspect he wants to run a daily ETL
job which is aligned on day boundaries but he wants data to settle before
kicking off the job at 1:10a every day due to late arriving data. Though we can
kick the run off at 1:10a every day, the execution date for each run will be
20160103T011000 every day. The problem is that the ETL actually needs to cover
the day boundaries from midnight-to-midnight each day, not from 1:10a to 1:10a.
So, what happens is that the dag runs and the task instance execution dates
will reflect the scheduling interval, not the dates covered by the ETL process.
The developer, will need to do his own date trimming inside his DAG to get to
the day boundary (i.e. 20160101T011000 --> 20160101T000000) and his history
will reflect scheduling instead of data coverage period.
Today, because we require alignment, the instance dates and dag runs reflect
both the data coverage period and the schedule. Hence, the execution date can
be passed as window arguments to downstream systems. Does this make sense or am
I concerned about a non-issue? Essentially, this change will be break existing
functionality without people being aware.
-s
On Tuesday, April 26, 2016 9:42 AM, Jeremiah Lowin <[email protected]>
wrote:
On further thought, I understand the issue you're describing where this
could lead to out-of-order runs. In fact, Bolke alerted me to the
possibility earlier but I didn't make the connection! That feels like a
separate issue -- to guarantee that tasks are executed in order (and more
importantly that their database entries are created). I think the
depends_on_past issue is related but separate -- though clearly needs to be
fleshed out for all cases :)
On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <[email protected]> wrote:
> I'm afraid I disagree -- though we may be talking about two different
> issues. This issue deals specifically with how to identify the "past" TI
> when evaluating "depends_on_past", and shouldn't be impacted by shifting
> start_date, transparently or not.
>
> Here are three valid examples of depends_on_past DAGs that would fail to
> run with the current setup:
>
> 1. A DAG with no schedule that is only run manually or via ad-hoc
> backfill. Without a schedule_interval, depends_on_past will always fail
> (since it looks back one schedule_interval).
>
> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
> say a scheduled run succeeds and then an off-schedule run fails. When the
> next scheduled run starts, it shouldn't proceed because the most recent
> task failed -- but it will look back one schedule_interval, jumping OVER
> the most recent run, and decide it's ok to proceed.
>
> 3. A DAG with a schedule that is paused for a while. This DAG could be
> running perfectly fine, but if it is paused for a while and then resumed,
> its depends_on_past tasks will look back one schedule_interval and see
> nothing, and therefore refuse to run.
>
> So my proposal is simply that the depends_on_past logic looks back at the
> most recent task as opposed to rigidly assuming there is a task one
> schedule_interval ago. For a regularly scheduled DAG, this will result in
> absolutely no behavior change. However it will robustly support a much
> wider variety of cases like the ones I listed above.
>
> J
>
>
> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
> [email protected]> wrote:
>
>> >>> "The clear fix seems to be to have depends_on_past check the last TI
>> that
>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>> with the intent of the flag. I will submit a fix."
>>
>> I don't think so. This would lead to skipping runs, which depends_on_past
>> is used as a guarantee to run every TI, sequentially.
>>
>> Absolute scheduling (cron expressions) is much better than relative
>> scheduling (origin + interval). Though it's easy to make relative
>> scheduling behave in an absolute way. You just have to use a rounded
>> start_date to your schedule_interval, and not move things around. Dynamic
>> start_dates have always been a problem and should probably not be
>> supported, though there's no way for us to tell.
>>
>> Changing the schedule interval or the "origin time" is a bit tricky and
>> should be documented.
>>
>> If people use depend_on_past=True and change the origin or the interval,
>> they basically redefine what "past" actually means and will require to
>> "mark success" or defining a new "start_date" as a way to say "please
>> disregard depend_on_past for this date"
>>
>> For those who haven't fully digested "What's the deal with start_dates",
>> please take the time to read it:
>> http://pythonhosted.org/airflow/faq.html
>>
>> Also see this part of the docs:
>>
>>
>>
>> Max
>>
>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> wrote:
>>
>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>
>>> There are two large points that need to be addressed:
>>>
>>> 1. this particular issue arises because of an alignment issue between
>>> start_date and schedule_interval. This can only happen with cron-based
>>> schedule_intervals that describe absolute points in time (like “1am”) as
>>> opposed to time deltas (like “every hour”). Ironically, I once reported
>>> this same issue myself (#959). In the past (and in the docs) we have
>>> simply
>>> said that users must make sure the two params agree. We discussed the
>>> possibility of a DAG validation method to raise an error if the
>>> start_date
>>> and schedule_interval don’t align, but Bolke made the point (and I
>>> agreed)
>>> that in these cases, start_date is sort of like telling the scheduler to
>>> “start paying attention” as opposed to “this is my first execution date”.
>>> In #1427, the scheduler was being asked to start paying attention on
>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>>> it
>>> was scheduling a first run at midnight and a second run at 1:10.
>>>
>>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>>> change the scheduling logic so that the cron-based interval takes
>>> precedence over a start date. Specifically, the first date on or after
>>> the
>>> start_date that complies with the schedule_interval becomes the first
>>> execution date.
>>>
>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>> successful
>>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>>> very possible for the previous TI to have run at any time in the past,
>>> not
>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>>> with the situation described in point #1, where the first scheduled run
>>> is
>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>
>>> The clear fix seems to be to have depends_on_past check the last TI that
>>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>>> with the intent of the flag. I will submit a fix.
>>>
>>> -J
>>>
>>
>>