For this kind of use case someone should just use a TimeDeltaSensor as a root task in their DAG if 1:10a is the "trigger". I'm suspecting that person could use a better trigger than a TimeSensor though. What happens at 1:10a? If it's "the time at which the database scrapes are usually done", that user should just have a more explicit sensor validating that the scrapes a done, perhaps look for the data itself or some trigger file / partition, ...
Max On Tue, Apr 26, 2016 at 10:30 AM, Siddharth Anand < [email protected]> wrote: > I have the following concerns, which may be best illustrated with > examples: > Start date : 2016-01-01Schedule Interval : */20 * * * * (run every 20 > mins) > Expectation : Runs on > - 20160101T000000 > > - 20160101T002000 > > - 20160101T004000 > > - 20160101T014000 > > - ... > > Start date : 2016-01-01Schedule Interval : 10 1 * * * (run at 1:10 every > day) > Expectation : Runs on > - 20160101T011000 > > - 20160102T011000 > > - 20160103T011000 > > - 20160104T011000 > - ... > > In my mind, the second case is invalid. I suspect he wants to run a daily > ETL job which is aligned on day boundaries but he wants data to settle > before kicking off the job at 1:10a every day due to late arriving data. > Though we can kick the run off at 1:10a every day, the execution date for > each run will be 20160103T011000 every day. The problem is that the ETL > actually needs to cover the day boundaries from midnight-to-midnight each > day, not from 1:10a to 1:10a. So, what happens is that the dag runs and the > task instance execution dates will reflect the scheduling interval, not the > dates covered by the ETL process. The developer, will need to do his own > date trimming inside his DAG to get to the day boundary (i.e. > 20160101T011000 --> 20160101T000000) and his history will reflect > scheduling instead of data coverage period. > Today, because we require alignment, the instance dates and dag runs > reflect both the data coverage period and the schedule. Hence, the > execution date can be passed as window arguments to downstream systems. > Does this make sense or am I concerned about a non-issue? Essentially, this > change will be break existing functionality without people being aware. > -s > > On Tuesday, April 26, 2016 9:42 AM, Jeremiah Lowin <[email protected]> > wrote: > > > On further thought, I understand the issue you're describing where this > could lead to out-of-order runs. In fact, Bolke alerted me to the > possibility earlier but I didn't make the connection! That feels like a > separate issue -- to guarantee that tasks are executed in order (and more > importantly that their database entries are created). I think the > depends_on_past issue is related but separate -- though clearly needs to be > fleshed out for all cases :) > > On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <[email protected]> wrote: > > > I'm afraid I disagree -- though we may be talking about two different > > issues. This issue deals specifically with how to identify the "past" TI > > when evaluating "depends_on_past", and shouldn't be impacted by shifting > > start_date, transparently or not. > > > > Here are three valid examples of depends_on_past DAGs that would fail to > > run with the current setup: > > > > 1. A DAG with no schedule that is only run manually or via ad-hoc > > backfill. Without a schedule_interval, depends_on_past will always fail > > (since it looks back one schedule_interval). > > > > 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's > > say a scheduled run succeeds and then an off-schedule run fails. When the > > next scheduled run starts, it shouldn't proceed because the most recent > > task failed -- but it will look back one schedule_interval, jumping OVER > > the most recent run, and decide it's ok to proceed. > > > > 3. A DAG with a schedule that is paused for a while. This DAG could be > > running perfectly fine, but if it is paused for a while and then resumed, > > its depends_on_past tasks will look back one schedule_interval and see > > nothing, and therefore refuse to run. > > > > So my proposal is simply that the depends_on_past logic looks back at the > > most recent task as opposed to rigidly assuming there is a task one > > schedule_interval ago. For a regularly scheduled DAG, this will result in > > absolutely no behavior change. However it will robustly support a much > > wider variety of cases like the ones I listed above. > > > > J > > > > > > On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin < > > [email protected]> wrote: > > > >> >>> "The clear fix seems to be to have depends_on_past check the last TI > >> that > >> ran, regardless of whether it ran `schedule_interval` ago. That's in > line > >> with the intent of the flag. I will submit a fix." > >> > >> I don't think so. This would lead to skipping runs, which > depends_on_past > >> is used as a guarantee to run every TI, sequentially. > >> > >> Absolute scheduling (cron expressions) is much better than relative > >> scheduling (origin + interval). Though it's easy to make relative > >> scheduling behave in an absolute way. You just have to use a rounded > >> start_date to your schedule_interval, and not move things around. > Dynamic > >> start_dates have always been a problem and should probably not be > >> supported, though there's no way for us to tell. > >> > >> Changing the schedule interval or the "origin time" is a bit tricky and > >> should be documented. > >> > >> If people use depend_on_past=True and change the origin or the interval, > >> they basically redefine what "past" actually means and will require to > >> "mark success" or defining a new "start_date" as a way to say "please > >> disregard depend_on_past for this date" > >> > >> For those who haven't fully digested "What's the deal with start_dates", > >> please take the time to read it: > >> http://pythonhosted.org/airflow/faq.html > >> > >> Also see this part of the docs: > >> > >> > >> > >> Max > >> > >> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> > wrote: > >> > >>> Bolke, Sid and I had a brief conversation to discuss some of the > >>> implications of https://github.com/airbnb/airflow/issues/1427 > >>> > >>> There are two large points that need to be addressed: > >>> > >>> 1. this particular issue arises because of an alignment issue between > >>> start_date and schedule_interval. This can only happen with cron-based > >>> schedule_intervals that describe absolute points in time (like “1am”) > as > >>> opposed to time deltas (like “every hour”). Ironically, I once reported > >>> this same issue myself (#959). In the past (and in the docs) we have > >>> simply > >>> said that users must make sure the two params agree. We discussed the > >>> possibility of a DAG validation method to raise an error if the > >>> start_date > >>> and schedule_interval don’t align, but Bolke made the point (and I > >>> agreed) > >>> that in these cases, start_date is sort of like telling the scheduler > to > >>> “start paying attention” as opposed to “this is my first execution > date”. > >>> In #1427, the scheduler was being asked to start paying attention on > >>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. > However, > >>> it > >>> was scheduling a first run at midnight and a second run at 1:10. > >>> > >>> Regardless of whether we choose to validate/warn/error, Bolke is going > to > >>> change the scheduling logic so that the cron-based interval takes > >>> precedence over a start date. Specifically, the first date on or after > >>> the > >>> start_date that complies with the schedule_interval becomes the first > >>> execution date. > >>> > >>> 2. Issue #1 led to a second issue: depends_on_past checks for a > >>> successful > >>> TI at `execution_date - schedule_interval`. This is fragile, since it > is > >>> very possible for the previous TI to have run at any time in the past, > >>> not > >>> just one schedule_interval ago. This can happen easily with ad-hoc DAG > >>> runs, and also if a DAG was paused for a while. Less commonly, it > happens > >>> with the situation described in point #1, where the first scheduled run > >>> is > >>> off-schedule (the midnight run followed by the daily 1:10am runs). > >>> > >>> The clear fix seems to be to have depends_on_past check the last TI > that > >>> ran, regardless of whether it ran `schedule_interval` ago. That's in > line > >>> with the intent of the flag. I will submit a fix. > >>> > >>> -J > >>> > >> > >> > > >
