All, I have created AIRFLOW-20 to continue the discussion on the PR/Code.
https://issues.apache.org/jira/browse/AIRFLOW-20 Bolke > Op 28 apr. 2016, om 20:08 heeft Bolke de Bruin <[email protected]> het > volgende geschreven: > > Ok PR-1431 now passes all tests for mysql. Postgresql/sqlite have some > alembic upgrade issues, that I don’t know yet how to solve. Basically, due to > the fact that a unique constraint is added and dropped in the same > transaction it does not work. So any ideas are welcome. > > I have to revert my comment earlier on the deadlock issues not there anymore. > I got confused as a depend_on_past block is reported as a deadlock by the > backfill. I did update the tests to make sure there is a past to depend on > (and thus fail if required). > > On the first issue you mention, Jeremiah, a new state for DagRuns > WAITING_FOR_PREVIOUS would be the way forward. This would allow you to walk > the timeline. With the PR detecting this kind “deadlock” is already very > simple. For backfills, as a backfill job is currently immediate, the failure > should also be immediate I think: not deadlocked but reported as cannot > schedule as previous run not satisfied, which is easy now and we can even > report which root-run has not finished yet with WAITING_FOR_PREVIOUS. > > On the second issue, with the PR this is no longer an issue. If the backfill > run is the first one it will know so, and just run. Hence this parameter > superfluous in my opinion. I also consider the parameter a workaround > currently. > > Bolke > >> Op 28 apr. 2016, om 14:11 heeft Jeremiah Lowin <[email protected]> het >> volgende geschreven: >> >> The depends on past deadlock is the worst kind of deadlock (to catch in >> code). Basically, if you're a DAG and you have "active" (runnable) tasks >> but none of them have dependencies met, that's a deadlock. But if your >> tasks depend_on_past we can't draw that conclusion since even though the >> dependencies are unmet NOW they might be met in the future. However this >> will still be an issue even with your change Bolke because let's say you >> have 10 DagRuns active -- 9 of them will report that they can't start, but >> we don't want to fail them because they are just waiting for past runs. So >> I think the deadlock unit tests just need to change from looking at the >> next run to looking 2+ runs out. >> >> Ignore_first_depends_on_past is to allow arbitrary backfills of >> depends_on_past tasks. In other words, if you want to backfill from A to B, >> we want to make sure you can run the first one. It used to be that when you >> ran a backfill, it overrode the start date of the task to trick the task in >> to running on the first back filled date. Unfortunately that change didn't >> survive into nested sub processes so it was a fragile hack and >> ignore_first_depends_on_past replaced it. It's probably not ideal but it's >> more robust. >> On Thu, Apr 28, 2016 at 3:53 AM Bolke de Bruin <[email protected]> wrote: >> >>> I just realized that an extra “benefit” (depending how you look at it) of >>> this implementation is that many of the deadlock issues are gone. >>> >>> * If there is no previous run, it will satisfy depend_on_past so the >>> test_backfill_depends_on_past that checks for a deadlock (instead of maybe >>> really checking for the dependency) will fail as there is no deadlock >>> anymore. >>> * test_cli_backfill_depends_on_past and test_dagrun_deadlock also fail as >>> there is no deadlock anymore >>> >>> Why does this not happen? >>> >>> If a run detects it is the first run, it will start. This removes the >>> requirement for ignore_first_depends_on_past (was this present for a real >>> use case or a workaround for the scheduler?). >>> >>> - Bolke >>> >>>> Op 28 apr. 2016, om 09:08 heeft Bolke de Bruin <[email protected]> het >>> volgende geschreven: >>>> >>>> No I don’t think that covers it (although I get slightly confused by >>> your comments "depends_on_past_schedule >>>> “ and “depends_on_past_run”, but it seems too complex from the user’s >>> perspective at first glance). The DagRun.previous handling is used to >>> connect previous related DagRuns in order to make depends_on_past work with >>> a moving interval or an automatically aligned start_date. >>>> >>>> Imagine the following >>>> >>>> Use Case 1 >>>> Dag1 has an assigned start_date "2016-04-24 00:00:00" with a cron >>> schedule of “10 1 * * *”, ie run every day at 1.10am. Depends_on_past is >>> true. >>>> >>>> The current scheduler kicks off a run “2016-04-24 00:00:00” and then >>> stops as the previous interval of “2016-04-24 01:10:00” is “2016-04-23 >>> 01:10:00”. This cannot be found as it has not taken place and thus the >>> whole thing grinds to a halt and the TaskInstances refuse to run. >>>> >>>> What the user expects here is that the first run is “2016-04-24 >>> 01:10:00”, ie start_date + interval, unless the start_date is on the >>> interval, ie. start_date is first interval. This is what I address by >>> start_date normalization in the PR. However, the second issue then kicks in >>> as the “previous” run can still not be found. >>>> >>>> Use Case 2 >>>> Dag2 has an assigned start_date "2016-04-24 01:10:00" with a cron >>> schedule of “10 1 * * *”. Depends_on_past is true. >>>> >>>> The scheduler happily goes through a few runs, but then the dag is >>> updated and the schedule adjusted. Because the previous interval cannot be >>> found by the TaskInstance (emphasis), tasks get stuck again requiring an >>> manual override. >>>> >>>> What the user expects here is that the scheduler is smart enough to >>> figure out that we are still running the same dag and that it needs to look >>> up the previous run for that dag and make sure dependencies are met with >>> that previous dagrun in mind. >>>> >>>> I don’t think those two use cases are edge cases, considering the amount >>> of questions we get on these subjects. >>>> >>>> To resolve the remaining issues (aside from start_date normalization) I >>> first made DagRun aware of its predecessors. Then I strengthened the >>> relationship between TaskInstances and DagRuns slightly, by optionally >>> including a dagrun_id in the TaskInstance. Now a TaskInstance can lookup >>> its predecessors in the previous DagRun and know for sure that it is either >>> the first run or it has a predecessor somewhere in time instead of guessing. >>>> >>>> What I am unsure of is what to consider is the unit of work: 1) is a >>> TaskInstance dependent on its past predecessor ignoring the outcome of the >>> DagRun? (previous TaskInstance can be successful, but previous DagRun as a >>> whole can fail) or 2) is it dependent on the outcome of the DagRun? 3) Can >>> it be both? In case of 1 and 3 my logic needs to be updated slightly, but >>> that should not be too much of a big deal. However I have difficulty >>> imagining why you want to do that. >>>> >>>> - Bolke >>>> >>>> >>>>> Op 27 apr. 2016, om 18:17 heeft Maxime Beauchemin < >>> [email protected]> het volgende geschreven: >>>>> >>>>> Added a few comments on the PR: >>>>> >>>>> ------------------------------------------ >>>>> >>>>> So if I understand right the previous DagRun.id handling is to account >>> for >>>>> DAGs that have a mixed workload of scheduled and triggered DagRuns >>> right? >>>>> >>>>> Do we have real world use cases for those? I always though a DAG would >>>>> either be scheduled, or externally triggered, not a mix of both, but we >>>>> never made that a real restriction, so I'm sure some people mix both. >>>>> >>>>> If that is the case that we want to support both, the notion of >>>>> depends_on_past become mushy, where maybe people may expect or desire >>>>> either a depends_on_past_schedule or depends_on_past_run. >>>>> >>>>> Since it's pretty edge-case (combo of both mixed scheduled AND >>>>> depend_on_past=True) and unclear what the desired behavior might be, I'd >>>>> vote to do something simple in terms of logic and document it clearly. >>> It >>>>> could be something like "if the DAG is scheduled, then we use the latest >>>>> DagRun that is >= previous schedule, if not scheduled then we take the >>>>> previous DagRun, whatever it is at that time.", this would not require >>>>> handling the previous_dag_run_id >>>>> >>>>> On Wed, Apr 27, 2016 at 6:02 AM, Bolke de Bruin <[email protected]> >>> wrote: >>>>> >>>>>> I have also given it some thought and some coding. I hope we agree that >>>>>> from a user perspective >>>>>> there currently is an issue. The need to align the start_date with the >>>>>> interval is counter intuitive >>>>>> and leads to a lot of questions and issue creation, although it is in >>> the >>>>>> documentation. If we are >>>>>> able to fix this with none or little consequences for current setups >>> that >>>>>> should be preferred, I think. >>>>>> The dependency explainer is really great work, but it doesn’t address >>> the >>>>>> core issue. >>>>>> >>>>>> I do recognize the issue mentioned by Max and I agree with Jeremiah >>> that >>>>>> it is a separate issue, >>>>>> however if we are to tackle the above it needs to be done in one go. >>>>>> >>>>>> If you consider a DAG a description of cohesion between work items (in >>> OOP >>>>>> java terms >>>>>> a class), then a DagRun is the instantiation of a DAG in time (in OOP >>> java >>>>>> terms an instance). >>>>>> Tasks are then the description of a work item and a TaskInstance the >>>>>> instantiation of the Task in time. >>>>>> >>>>>> In my opinion issues pop up due to the current paradigm of considering >>> the >>>>>> TaskInstance >>>>>> the smallest unit of work and asking it to maintain its own state in >>>>>> relation to other TaskInstances >>>>>> in a DagRun and in a previous DagRun of which it has no (real) >>> perception. >>>>>> Tasks are instantiated >>>>>> by a cartesian product with the dates of DagRun instead of the DagRuns >>>>>> itself. >>>>>> >>>>>> The very loose coupling between DagRuns and TaskInstances can be >>> improved >>>>>> while maintaining >>>>>> flexibility to run tasks without a DagRun. This would help with a >>> couple >>>>>> of things: >>>>>> >>>>>> 1. start_date can be used as a ‘execution_date’ or a point in time >>> when to >>>>>> start looking >>>>>> 2. a new interval for a dag will maintain depends_on_past >>>>>> 3. paused dags do not give trouble >>>>>> 4. tasks will be executed in order >>>>>> 5. the ignore_first_depend_on_past could be removed as a task will now >>>>>> know if it is really the first >>>>>> >>>>>> In PR-1431 a lot of this work has been done by: >>>>>> >>>>>> 1. Adding a “previous” field to a DagRun allowing it to connect to its >>>>>> predecessor >>>>>> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about >>> the >>>>>> DagRun if needed >>>>>> 3. Using start_date + interval as the first run date unless start_date >>> is >>>>>> on the interval then start_date is the first run date >>>>>> >>>>>> The downside of the PR would be that TaskInstances without a DagRun >>> will >>>>>> not honor >>>>>> depend_on_past, so the backfill would need to be adjusted to create a >>>>>> DagRun >>>>>> and properly connect it to a previous DagRun if that exists. Not too >>> much >>>>>> of a big >>>>>> deal, I think - will work on that a bit as well in the PR. >>>>>> >>>>>> So this should give at least an idea where things could head to. Any >>>>>> thoughts? >>>>>> >>>>>> Cheers >>>>>> Bolke >>>>>> >>>>>>> Op 26 apr. 2016, om 18:42 heeft Jeremiah Lowin <[email protected]> >>> het >>>>>> volgende geschreven: >>>>>>> >>>>>>> On further thought, I understand the issue you're describing where >>> this >>>>>>> could lead to out-of-order runs. In fact, Bolke alerted me to the >>>>>>> possibility earlier but I didn't make the connection! That feels like >>> a >>>>>>> separate issue -- to guarantee that tasks are executed in order (and >>> more >>>>>>> importantly that their database entries are created). I think the >>>>>>> depends_on_past issue is related but separate -- though clearly needs >>> to >>>>>> be >>>>>>> fleshed out for all cases :) >>>>>>> >>>>>>> On Tue, Apr 26, 2016 at 12:28 PM Jeremiah Lowin <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>>> I'm afraid I disagree -- though we may be talking about two different >>>>>>>> issues. This issue deals specifically with how to identify the >>> "past" TI >>>>>>>> when evaluating "depends_on_past", and shouldn't be impacted by >>> shifting >>>>>>>> start_date, transparently or not. >>>>>>>> >>>>>>>> Here are three valid examples of depends_on_past DAGs that would >>> fail to >>>>>>>> run with the current setup: >>>>>>>> >>>>>>>> 1. A DAG with no schedule that is only run manually or via ad-hoc >>>>>>>> backfill. Without a schedule_interval, depends_on_past will always >>> fail >>>>>>>> (since it looks back one schedule_interval). >>>>>>>> >>>>>>>> 2. A DAG with a schedule, but that is sometimes run off-schedule. >>> Let's >>>>>>>> say a scheduled run succeeds and then an off-schedule run fails. When >>>>>> the >>>>>>>> next scheduled run starts, it shouldn't proceed because the most >>> recent >>>>>>>> task failed -- but it will look back one schedule_interval, jumping >>> OVER >>>>>>>> the most recent run, and decide it's ok to proceed. >>>>>>>> >>>>>>>> 3. A DAG with a schedule that is paused for a while. This DAG could >>> be >>>>>>>> running perfectly fine, but if it is paused for a while and then >>>>>> resumed, >>>>>>>> its depends_on_past tasks will look back one schedule_interval and >>> see >>>>>>>> nothing, and therefore refuse to run. >>>>>>>> >>>>>>>> So my proposal is simply that the depends_on_past logic looks back at >>>>>> the >>>>>>>> most recent task as opposed to rigidly assuming there is a task one >>>>>>>> schedule_interval ago. For a regularly scheduled DAG, this will >>> result >>>>>> in >>>>>>>> absolutely no behavior change. However it will robustly support a >>> much >>>>>>>> wider variety of cases like the ones I listed above. >>>>>>>> >>>>>>>> J >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 26, 2016 at 11:08 AM Maxime Beauchemin < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>>>>> "The clear fix seems to be to have depends_on_past check the >>> last TI >>>>>>>>> that >>>>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's in >>>>>> line >>>>>>>>> with the intent of the flag. I will submit a fix." >>>>>>>>> >>>>>>>>> I don't think so. This would lead to skipping runs, which >>>>>> depends_on_past >>>>>>>>> is used as a guarantee to run every TI, sequentially. >>>>>>>>> >>>>>>>>> Absolute scheduling (cron expressions) is much better than relative >>>>>>>>> scheduling (origin + interval). Though it's easy to make relative >>>>>>>>> scheduling behave in an absolute way. You just have to use a rounded >>>>>>>>> start_date to your schedule_interval, and not move things around. >>>>>> Dynamic >>>>>>>>> start_dates have always been a problem and should probably not be >>>>>>>>> supported, though there's no way for us to tell. >>>>>>>>> >>>>>>>>> Changing the schedule interval or the "origin time" is a bit tricky >>> and >>>>>>>>> should be documented. >>>>>>>>> >>>>>>>>> If people use depend_on_past=True and change the origin or the >>>>>> interval, >>>>>>>>> they basically redefine what "past" actually means and will require >>> to >>>>>>>>> "mark success" or defining a new "start_date" as a way to say >>> "please >>>>>>>>> disregard depend_on_past for this date" >>>>>>>>> >>>>>>>>> For those who haven't fully digested "What's the deal with >>>>>> start_dates", >>>>>>>>> please take the time to read it: >>>>>>>>> http://pythonhosted.org/airflow/faq.html >>>>>>>>> >>>>>>>>> Also see this part of the docs: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Max >>>>>>>>> >>>>>>>>> On Mon, Apr 25, 2016 at 1:14 PM, Jeremiah Lowin <[email protected]> >>>>>> wrote: >>>>>>>>> >>>>>>>>>> Bolke, Sid and I had a brief conversation to discuss some of the >>>>>>>>>> implications of https://github.com/airbnb/airflow/issues/1427 >>>>>>>>>> >>>>>>>>>> There are two large points that need to be addressed: >>>>>>>>>> >>>>>>>>>> 1. this particular issue arises because of an alignment issue >>> between >>>>>>>>>> start_date and schedule_interval. This can only happen with >>> cron-based >>>>>>>>>> schedule_intervals that describe absolute points in time (like >>> “1am”) >>>>>> as >>>>>>>>>> opposed to time deltas (like “every hour”). Ironically, I once >>>>>> reported >>>>>>>>>> this same issue myself (#959). In the past (and in the docs) we >>> have >>>>>>>>>> simply >>>>>>>>>> said that users must make sure the two params agree. We discussed >>> the >>>>>>>>>> possibility of a DAG validation method to raise an error if the >>>>>>>>>> start_date >>>>>>>>>> and schedule_interval don’t align, but Bolke made the point (and I >>>>>>>>>> agreed) >>>>>>>>>> that in these cases, start_date is sort of like telling the >>> scheduler >>>>>> to >>>>>>>>>> “start paying attention” as opposed to “this is my first execution >>>>>> date”. >>>>>>>>>> In #1427, the scheduler was being asked to start paying attention >>> on >>>>>>>>>> 4/24/16 00:00:00 but not to do anything until 4/24/16 01:10:00. >>>>>> However, >>>>>>>>>> it >>>>>>>>>> was scheduling a first run at midnight and a second run at 1:10. >>>>>>>>>> >>>>>>>>>> Regardless of whether we choose to validate/warn/error, Bolke is >>>>>> going to >>>>>>>>>> change the scheduling logic so that the cron-based interval takes >>>>>>>>>> precedence over a start date. Specifically, the first date on or >>> after >>>>>>>>>> the >>>>>>>>>> start_date that complies with the schedule_interval becomes the >>> first >>>>>>>>>> execution date. >>>>>>>>>> >>>>>>>>>> 2. Issue #1 led to a second issue: depends_on_past checks for a >>>>>>>>>> successful >>>>>>>>>> TI at `execution_date - schedule_interval`. This is fragile, since >>> it >>>>>> is >>>>>>>>>> very possible for the previous TI to have run at any time in the >>> past, >>>>>>>>>> not >>>>>>>>>> just one schedule_interval ago. This can happen easily with ad-hoc >>> DAG >>>>>>>>>> runs, and also if a DAG was paused for a while. Less commonly, it >>>>>> happens >>>>>>>>>> with the situation described in point #1, where the first scheduled >>>>>> run >>>>>>>>>> is >>>>>>>>>> off-schedule (the midnight run followed by the daily 1:10am runs). >>>>>>>>>> >>>>>>>>>> The clear fix seems to be to have depends_on_past check the last TI >>>>>> that >>>>>>>>>> ran, regardless of whether it ran `schedule_interval` ago. That's >>> in >>>>>> line >>>>>>>>>> with the intent of the flag. I will submit a fix. >>>>>>>>>> >>>>>>>>>> -J >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> >>>>>> >>>> >>> >>> >
