I just realized that an extra “benefit” (depending how you look at it) of this implementation is that many of the deadlock issues are gone.
* If there is no previous run, it will satisfy depend_on_past so the test_backfill_depends_on_past that checks for a deadlock (instead of maybe really checking for the dependency) will fail as there is no deadlock anymore. * test_cli_backfill_depends_on_past and test_dagrun_deadlock also fail as there is no deadlock anymore Why does this not happen? If a run detects it is the first run, it will start. This removes the requirement for ignore_first_depends_on_past (was this present for a real use case or a workaround for the scheduler?). - Bolke > Op 28 apr. 2016, om 09:08 heeft Bolke de Bruin <[email protected]> het > volgende geschreven: > > No I don’t think that covers it (although I get slightly confused by your > comments "depends_on_past_schedule > “ and “depends_on_past_run”, but it seems too complex from the user’s > perspective at first glance). The DagRun.previous handling is used to connect > previous related DagRuns in order to make depends_on_past work with a moving > interval or an automatically aligned start_date. > > Imagine the following > > Use Case 1 > Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron schedule of > “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is true. > > The current scheduler kicks off a run “2016-04-24 00:00:00” and then stops as > the previous interval of “2016-04-24 01:10:00” is “2016-04-23 01:10:00”. This > cannot be found as it has not taken place and thus the whole thing grinds to > a halt and the TaskInstances refuse to run. > > What the user expects here is that the first run is “2016-04-24 01:10:00”, ie > start_date + interval, unless the start_date is on the interval, ie. > start_date is first interval. This is what I address by start_date > normalization in the PR. However, the second issue then kicks in as the > “previous” run can still not be found. > > Use Case 2 > Dag2 has an assigned start_date "2016-04-24 01:10:00" with a cron schedule > of “10 1 * * *”. Depends_on_past is true. > > The scheduler happily goes through a few runs, but then the dag is updated > and the schedule adjusted. Because the previous interval cannot be found by > the TaskInstance (emphasis), tasks get stuck again requiring an manual > override. > > What the user expects here is that the scheduler is smart enough to figure > out that we are still running the same dag and that it needs to look up the > previous run for that dag and make sure dependencies are met with that > previous dagrun in mind. > > I don’t think those two use cases are edge cases, considering the amount of > questions we get on these subjects. > > To resolve the remaining issues (aside from start_date normalization) I first > made DagRun aware of its predecessors. Then I strengthened the relationship > between TaskInstances and DagRuns slightly, by optionally including a > dagrun_id in the TaskInstance. Now a TaskInstance can lookup its predecessors > in the previous DagRun and know for sure that it is either the first run or > it has a predecessor somewhere in time instead of guessing. > > What I am unsure of is what to consider is the unit of work: 1) is a > TaskInstance dependent on its past predecessor ignoring the outcome of the > DagRun? (previous TaskInstance can be successful, but previous DagRun as a > whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can it > be both? In case of 1 and 3 my logic needs to be updated slightly, but that > should not be too much of a big deal. However I have difficulty imagining why > you want to do that. > > - Bolke > > >> Op 27 apr. 2016, om 18:17 heeft Maxime Beauchemin >> <[email protected]> het volgende geschreven: >> >> Added a few comments on the PR: >> >> ------------------------------------------ >> >> So if I understand right the previous DagRun.id handling is to account for >> DAGs that have a mixed workload of scheduled and triggered DagRuns right? >> >> Do we have real world use cases for those? I always though a DAG would >> either be scheduled, or externally triggered, not a mix of both, but we >> never made that a real restriction, so I'm sure some people mix both. >> >> If that is the case that we want to support both, the notion of >> depends_on_past become mushy, where maybe people may expect or desire >> either a depends_on_past_schedule or depends_on_past_run. >> >> Since it's pretty edge-case (combo of both mixed scheduled AND >> depend_on_past=True) and unclear what the desired behavior might be, I'd >> vote to do something simple in terms of logic and document it clearly. It >> could be something like "if the DAG is scheduled, then we use the latest >> DagRun that is >= previous schedule, if not scheduled then we take the >> previous DagRun, whatever it is at that time.", this would not require >> handling the previous_dag_run_id >> >> On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <[email protected]> wrote: >> >>> I have also given it some thought and some coding. I hope we agree that >>> from a user perspective >>> there currently is an issue. The need to align the start_date with the >>> interval is counter intuitive >>> and leads to a lot of questions and issue creation, although it is in the >>> documentation. If we are >>> able to fix this with none or little consequences for current setups that >>> should be preferred, I think. >>> The dependency explainer is really great work, but it doesn’t address the >>> core issue. >>> >>> I do recognize the issue mentioned by Max and I agree with Jeremiah that >>> it is a separate issue, >>> however if we are to tackle the above it needs to be done in one go. >>> >>> If you consider a DAG a description of cohesion between work items (in OOP >>> java terms >>> a class), then a DagRun is the instantiation of a DAG in time (in OOP java >>> terms an instance). >>> Tasks are then the description of a work item and a TaskInstance the >>> instantiation of the Task in time. >>> >>> In my opinion issues pop up due to the current paradigm of considering the >>> TaskInstance >>> the smallest unit of work and asking it to maintain its own state in >>> relation to other TaskInstances >>> in a DagRun and in a previous DagRun of which it has no (real) perception. >>> Tasks are instantiated >>> by a cartesian product with the dates of DagRun instead of the DagRuns >>> itself. >>> >>> The very loose coupling between DagRuns and TaskInstances can be improved >>> while maintaining >>> flexibility to run tasks without a DagRun. This would help with a couple >>> of things: >>> >>> 1. start_date can be used as a ‘execution_date’ or a point in time when to >>> start looking >>> 2. a new interval for a dag will maintain depends_on_past >>> 3. paused dags do not give trouble >>> 4. tasks will be executed in order >>> 5. the ignore_first_depend_on_past could be removed as a task will now >>> know if it is really the first >>> >>> In PR-1431 a lot of this work has been done by: >>> >>> 1. Adding a “previous” field to a DagRun allowing it to connect to its >>> predecessor >>> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the >>> DagRun if needed >>> 3. Using start_date + interval as the first run date unless start_date is >>> on the interval then start_date is the first run date >>> >>> The downside of the PR would be that TaskInstances without a DagRun will >>> not honor >>> depend_on_past, so the backfill would need to be adjusted to create a >>> DagRun >>> and properly connect it to a previous DagRun if that exists. Not too much >>> of a big >>> deal, I think - will work on that a bit as well in the PR. >>> >>> So this should give at least an idea where things could head to. Any >>> thoughts? >>> >>> Cheers >>> Bolke >>> >>>> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <[email protected]> het >>> volgende geschreven: >>>> >>>> On further thought, I understand the issue you're describing where this >>>> could lead to out-of-order runs. In fact, Bolke alerted me to the >>>> possibility earlier but I didn't make the connection! That feels like a >>>> separate issue -- to guarantee that tasks are executed in order (and more >>>> importantly that their database entries are created). I think the >>>> depends_on_past issue is related but separate -- though clearly needs to >>> be >>>> fleshed out for all cases :) >>>> >>>> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <[email protected]> >>> wrote: >>>> >>>>> I'm afraid I disagree -- though we may be talking about two different >>>>> issues. This issue deals specifically with how to identify the "past" TI >>>>> when evaluating "depends_on_past", and shouldn't be impacted by shifting >>>>> start_date, transparently or not. >>>>> >>>>> Here are three valid examples of depends_on_past DAGs that would fail to >>>>> run with the current setup: >>>>> >>>>> 1. A DAG with no schedule that is only run manually or via ad-hoc >>>>> backfill. Without a schedule_interval, depends_on_past will always fail >>>>> (since it looks back one schedule_interval). >>>>> >>>>> 2. A DAG with a schedule, but that is sometimes run off-schedule. Let's >>>>> say a scheduled run succeeds and then an off-schedule run fails. When >>> the >>>>> next scheduled run starts, it shouldn't proceed because the most recent >>>>> task failed -- but it will look back one schedule_interval, jumping OVER >>>>> the most recent run, and decide it's ok to proceed. >>>>> >>>>> 3. A DAG with a schedule that is paused for a while. This DAG could be >>>>> running perfectly fine, but if it is paused for a while and then >>> resumed, >>>>> its depends_on_past tasks will look back one schedule_interval and see >>>>> nothing, and therefore refuse to run. >>>>> >>>>> So my proposal is simply that the depends_on_past logic looks back at >>> the >>>>> most recent task as opposed to rigidly assuming there is a task one >>>>> schedule_interval ago. For a regularly scheduled DAG, this will result >>> in >>>>> absolutely no behavior change. However it will robustly support a much >>>>> wider variety of cases like the ones I listed above. >>>>> >>>>> J >>>>> >>>>> >>>>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin < >>>>> [email protected]> wrote: >>>>> >>>>>>>>> "The clear fix seems to be to have depends_on_past check the last TI >>>>>> that >>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in >>> line >>>>>> with the intent of the flag. I will submit a fix." >>>>>> >>>>>> I don't think so. This would lead to skipping runs, which >>> depends_on_past >>>>>> is used as a guarantee to run every TI, sequentially. >>>>>> >>>>>> Absolute scheduling (cron expressions) is much better than relative >>>>>> scheduling (origin + interval). Though it's easy to make relative >>>>>> scheduling behave in an absolute way. You just have to use a rounded >>>>>> start_date to your schedule_interval, and not move things around. >>> Dynamic >>>>>> start_dates have always been a problem and should probably not be >>>>>> supported, though there's no way for us to tell. >>>>>> >>>>>> Changing the schedule interval or the "origin time" is a bit tricky and >>>>>> should be documented. >>>>>> >>>>>> If people use depend_on_past=True and change the origin or the >>> interval, >>>>>> they basically redefine what "past" actually means and will require to >>>>>> "mark success" or defining a new "start_date" as a way to say "please >>>>>> disregard depend_on_past for this date" >>>>>> >>>>>> For those who haven't fully digested "What's the deal with >>> start_dates", >>>>>> please take the time to read it: >>>>>> http://pythonhosted.org/airflow/faq.html >>>>>> >>>>>> Also see this part of the docs: >>>>>> >>>>>> >>>>>> >>>>>> Max >>>>>> >>>>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> >>> wrote: >>>>>> >>>>>>> Bolke, Sid and I had a brief conversation to discuss some of the >>>>>>> implications of https://github.com/airbnb/airflow/issues/1427 >>>>>>> >>>>>>> There are two large points that need to be addressed: >>>>>>> >>>>>>> 1. this particular issue arises because of an alignment issue between >>>>>>> start_date and schedule_interval. This can only happen with cron-based >>>>>>> schedule_intervals that describe absolute points in time (like “1am”) >>> as >>>>>>> opposed to time deltas (like “every hour”). Ironically, I once >>> reported >>>>>>> this same issue myself (#959). In the past (and in the docs) we have >>>>>>> simply >>>>>>> said that users must make sure the two params agree. We discussed the >>>>>>> possibility of a DAG validation method to raise an error if the >>>>>>> start_date >>>>>>> and schedule_interval don’t align, but Bolke made the point (and I >>>>>>> agreed) >>>>>>> that in these cases, start_date is sort of like telling the scheduler >>> to >>>>>>> “start paying attention” as opposed to “this is my first execution >>> date”. >>>>>>> In #1427, the scheduler was being asked to start paying attention on >>>>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. >>> However, >>>>>>> it >>>>>>> was scheduling a first run at midnight and a second run at 1:10. >>>>>>> >>>>>>> Regardless of whether we choose to validate/warn/error, Bolke is >>> going to >>>>>>> change the scheduling logic so that the cron-based interval takes >>>>>>> precedence over a start date. Specifically, the first date on or after >>>>>>> the >>>>>>> start_date that complies with the schedule_interval becomes the first >>>>>>> execution date. >>>>>>> >>>>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a >>>>>>> successful >>>>>>> TI at `execution_date - schedule_interval`. This is fragile, since it >>> is >>>>>>> very possible for the previous TI to have run at any time in the past, >>>>>>> not >>>>>>> just one schedule_interval ago. This can happen easily with ad-hoc DAG >>>>>>> runs, and also if a DAG was paused for a while. Less commonly, it >>> happens >>>>>>> with the situation described in point #1, where the first scheduled >>> run >>>>>>> is >>>>>>> off-schedule (the midnight run followed by the daily 1:10am runs). >>>>>>> >>>>>>> The clear fix seems to be to have depends_on_past check the last TI >>> that >>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in >>> line >>>>>>> with the intent of the flag. I will submit a fix. >>>>>>> >>>>>>> -J >>>>>>> >>>>>> >>>>>> >>> >>> >
