On further thought, I understand the issue you're describing where this could lead to out-of-order runs. In fact, Bolke alerted me to the possibility earlier but I didn't make the connection! That feels like a separate issue -- to guarantee that tasks are executed in order (and more importantly that their database entries are created). I think the depends_on_past issue is related but separate -- though clearly needs to be fleshed out for all cases :)
On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <[email protected]> wrote: > I'm afraid I disagree -- though we may be talking about two different > issues. This issue deals specifically with how to identify the "past" TI > when evaluating "depends_on_past", and shouldn't be impacted by shifting > start_date, transparently or not. > > Here are three valid examples of depends_on_past DAGs that would fail to > run with the current setup: > > 1. A DAG with no schedule that is only run manually or via ad-hoc > backfill. Without a schedule_interval, depends_on_past will always fail > (since it looks back one schedule_interval). > > 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's > say a scheduled run succeeds and then an off-schedule run fails. When the > next scheduled run starts, it shouldn't proceed because the most recent > task failed -- but it will look back one schedule_interval, jumping OVER > the most recent run, and decide it's ok to proceed. > > 3. A DAG with a schedule that is paused for a while. This DAG could be > running perfectly fine, but if it is paused for a while and then resumed, > its depends_on_past tasks will look back one schedule_interval and see > nothing, and therefore refuse to run. > > So my proposal is simply that the depends_on_past logic looks back at the > most recent task as opposed to rigidly assuming there is a task one > schedule_interval ago. For a regularly scheduled DAG, this will result in > absolutely no behavior change. However it will robustly support a much > wider variety of cases like the ones I listed above. > > J > > > On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin < > [email protected]> wrote: > >> >>> "The clear fix seems to be to have depends_on_past check the last TI >> that >> ran, regardless of whether it ran `schedule_interval` ago. That's in line >> with the intent of the flag. I will submit a fix." >> >> I don't think so. This would lead to skipping runs, which depends_on_past >> is used as a guarantee to run every TI, sequentially. >> >> Absolute scheduling (cron expressions) is much better than relative >> scheduling (origin + interval). Though it's easy to make relative >> scheduling behave in an absolute way. You just have to use a rounded >> start_date to your schedule_interval, and not move things around. Dynamic >> start_dates have always been a problem and should probably not be >> supported, though there's no way for us to tell. >> >> Changing the schedule interval or the "origin time" is a bit tricky and >> should be documented. >> >> If people use depend_on_past=True and change the origin or the interval, >> they basically redefine what "past" actually means and will require to >> "mark success" or defining a new "start_date" as a way to say "please >> disregard depend_on_past for this date" >> >> For those who haven't fully digested "What's the deal with start_dates", >> please take the time to read it: >> http://pythonhosted.org/airflow/faq.html >> >> Also see this part of the docs: >> >> >> >> Max >> >> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> wrote: >> >>> Bolke, Sid and I had a brief conversation to discuss some of the >>> implications of https://github.com/airbnb/airflow/issues/1427 >>> >>> There are two large points that need to be addressed: >>> >>> 1. this particular issue arises because of an alignment issue between >>> start_date and schedule_interval. This can only happen with cron-based >>> schedule_intervals that describe absolute points in time (like “1am”) as >>> opposed to time deltas (like “every hour”). Ironically, I once reported >>> this same issue myself (#959). In the past (and in the docs) we have >>> simply >>> said that users must make sure the two params agree. We discussed the >>> possibility of a DAG validation method to raise an error if the >>> start_date >>> and schedule_interval don’t align, but Bolke made the point (and I >>> agreed) >>> that in these cases, start_date is sort of like telling the scheduler to >>> “start paying attention” as opposed to “this is my first execution date”. >>> In #1427, the scheduler was being asked to start paying attention on >>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. However, >>> it >>> was scheduling a first run at midnight and a second run at 1:10. >>> >>> Regardless of whether we choose to validate/warn/error, Bolke is going to >>> change the scheduling logic so that the cron-based interval takes >>> precedence over a start date. Specifically, the first date on or after >>> the >>> start_date that complies with the schedule_interval becomes the first >>> execution date. >>> >>> 2. Issue #1 led to a second issue: depends_on_past checks for a >>> successful >>> TI at `execution_date - schedule_interval`. This is fragile, since it is >>> very possible for the previous TI to have run at any time in the past, >>> not >>> just one schedule_interval ago. This can happen easily with ad-hoc DAG >>> runs, and also if a DAG was paused for a while. Less commonly, it happens >>> with the situation described in point #1, where the first scheduled run >>> is >>> off-schedule (the midnight run followed by the daily 1:10am runs). >>> >>> The clear fix seems to be to have depends_on_past check the last TI that >>> ran, regardless of whether it ran `schedule_interval` ago. That's in line >>> with the intent of the flag. I will submit a fix. >>> >>> -J >>> >> >>
