I have also given it some thought and some coding. I hope we agree that from a 
user perspective
there currently is an issue. The need to align the start_date with the interval 
is counter intuitive
and leads to a lot of questions and issue creation, although it is in the 
documentation. If we are
able to fix this with none or little consequences for current setups that 
should be preferred, I think.
The dependency explainer is really great work, but it doesn’t address the core 
issue.

I do recognize the issue mentioned by Max and I agree with Jeremiah that it is 
a separate issue,
however if we are to tackle the above it needs to be done in one go.

If you consider a DAG a description of cohesion between work items (in OOP java 
terms
a class), then a DagRun is the instantiation of a DAG in time (in OOP java 
terms an instance). 
Tasks are then the description of a work item and a TaskInstance the 
instantiation of the Task in time.

In my opinion issues pop up due to the current paradigm of considering the 
TaskInstance
the smallest unit of work and asking it to maintain its own state in relation 
to other TaskInstances
in a DagRun and in a previous DagRun of which it has no (real) perception. 
Tasks are instantiated
by a cartesian product with the dates of DagRun instead of the DagRuns itself. 

The very loose coupling between DagRuns and TaskInstances can be improved while 
maintaining
flexibility to run tasks without a DagRun. This would help with a couple of 
things:

1. start_date can be used as a ‘execution_date’ or a point in time when to 
start looking
2. a new interval for a dag will maintain depends_on_past
3. paused dags do not give trouble
4. tasks will be executed in order 
5. the ignore_first_depend_on_past could be removed as a task will now know if 
it is really the first

In PR-1431 a lot of this work has been done by:

1. Adding a “previous” field to a DagRun allowing it to connect to its 
predecessor
2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the 
DagRun if needed
3. Using start_date + interval as the first run date unless start_date is on 
the interval then start_date is the first run date

The downside of the PR would be that TaskInstances without a DagRun will not 
honor
depend_on_past, so the backfill would need to be adjusted to create a DagRun
and properly connect it to a previous DagRun if that exists. Not too much of a 
big 
deal, I think - will work on that a bit as well in the PR.

So this should give at least an idea where things could head to. Any thoughts?

Cheers
Bolke

> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <[email protected]> het 
> volgende geschreven:
> 
> On further thought, I understand the issue you're describing where this
> could lead to out-of-order runs. In fact, Bolke alerted me to the
> possibility earlier but I didn't make the connection! That feels like a
> separate issue -- to guarantee that tasks are executed in order (and more
> importantly that their database entries are created). I think the
> depends_on_past issue is related but separate -- though clearly needs to be
> fleshed out for all cases :)
> 
> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <[email protected]> wrote:
> 
>> I'm afraid I disagree -- though we may be talking about two different
>> issues. This issue deals specifically with how to identify the "past" TI
>> when evaluating "depends_on_past", and shouldn't be impacted by shifting
>> start_date, transparently or not.
>> 
>> Here are three valid examples of depends_on_past DAGs that would fail to
>> run with the current setup:
>> 
>> 1. A DAG with no schedule that is only run manually or via ad-hoc
>> backfill. Without a schedule_interval, depends_on_past will always fail
>> (since it looks back one schedule_interval).
>> 
>> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's
>> say a scheduled run succeeds and then an off-schedule run fails. When the
>> next scheduled run starts, it shouldn't proceed because the most recent
>> task failed -- but it will look back one schedule_interval, jumping OVER
>> the most recent run, and decide it's ok to proceed.
>> 
>> 3. A DAG with a schedule that is paused for a while. This DAG could be
>> running perfectly fine, but if it is paused for a while and then resumed,
>> its depends_on_past tasks will look back one schedule_interval and see
>> nothing, and therefore refuse to run.
>> 
>> So my proposal is simply that the depends_on_past logic looks back at the
>> most recent task as opposed to rigidly assuming there is a task one
>> schedule_interval ago. For a regularly scheduled DAG, this will result in
>> absolutely no behavior change. However it will robustly support a much
>> wider variety of cases like the ones I listed above.
>> 
>> J
>> 
>> 
>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin <
>> [email protected]> wrote:
>> 
>>>>>> "The clear fix seems to be to have depends_on_past check the last TI
>>> that
>>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>>> with the intent of the flag. I will submit a fix."
>>> 
>>> I don't think so. This would lead to skipping runs, which depends_on_past
>>> is used as a guarantee to run every TI, sequentially.
>>> 
>>> Absolute scheduling (cron expressions) is much better than relative
>>> scheduling (origin + interval). Though it's easy to make relative
>>> scheduling behave in an absolute way. You just have to use a rounded
>>> start_date to your schedule_interval, and not move things around. Dynamic
>>> start_dates have always been a problem and should probably not be
>>> supported, though there's no way for us to tell.
>>> 
>>> Changing the schedule interval or the "origin time" is a bit tricky and
>>> should be documented.
>>> 
>>> If people use depend_on_past=True and change the origin or the interval,
>>> they basically redefine what "past" actually means and will require to
>>> "mark success" or defining a new "start_date" as a way to say "please
>>> disregard depend_on_past for this date"
>>> 
>>> For those who haven't fully digested "What's the deal with start_dates",
>>> please take the time to read it:
>>> http://pythonhosted.org/airflow/faq.html
>>> 
>>> Also see this part of the docs:
>>> 
>>> ​
>>> 
>>> Max
>>> 
>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> wrote:
>>> 
>>>> Bolke, Sid and I had a brief conversation to discuss some of the
>>>> implications of https://github.com/airbnb/airflow/issues/1427
>>>> 
>>>> There are two large points that need to be addressed:
>>>> 
>>>> 1. this particular issue arises because of an alignment issue between
>>>> start_date and schedule_interval. This can only happen with cron-based
>>>> schedule_intervals that describe absolute points in time (like “1am”) as
>>>> opposed to time deltas (like “every hour”). Ironically, I once reported
>>>> this same issue myself (#959). In the past (and in the docs) we have
>>>> simply
>>>> said that users must make sure the two params agree. We discussed the
>>>> possibility of a DAG validation method to raise an error if the
>>>> start_date
>>>> and schedule_interval don’t align, but Bolke made the point (and I
>>>> agreed)
>>>> that in these cases, start_date is sort of like telling the scheduler to
>>>> “start paying attention” as opposed to “this is my first execution date”.
>>>> In #1427, the scheduler was being asked to start paying attention on
>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However,
>>>> it
>>>> was scheduling a first run at midnight and a second run at 1:10.
>>>> 
>>>> Regardless of whether we choose to validate/warn/error, Bolke is going to
>>>> change the scheduling logic so that the cron-based interval takes
>>>> precedence over a start date. Specifically, the first date on or after
>>>> the
>>>> start_date that complies with the schedule_interval becomes the first
>>>> execution date.
>>>> 
>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a
>>>> successful
>>>> TI at `execution_date - schedule_interval`. This is fragile, since it is
>>>> very possible for the previous TI to have run at any time in the past,
>>>> not
>>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG
>>>> runs, and also if a DAG was paused for a while. Less commonly, it happens
>>>> with the situation described in point #1, where the first scheduled run
>>>> is
>>>> off-schedule (the midnight run followed by the daily 1:10am runs).
>>>> 
>>>> The clear fix seems to be to have depends_on_past check the last TI that
>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in line
>>>> with the intent of the flag. I will submit a fix.
>>>> 
>>>> -J
>>>> 
>>> 
>>> 

Reply via email to