I created https://issues.apache.org/jira/browse/AIRFLOW-921 to track the pending issues.
There are two more issues we found which I included there: 1. Task instances that have their state manually set to running make the UI for their DAG unable to parse 2. Mark success doesn't work for non existent task instances/dagruns which breaks the subdag use case (setting tasks as successful via the graph view) On Mon, Feb 27, 2017 at 11:06 AM, Bolke de Bruin <[email protected]> wrote: > Hey Max > > It is massive for sure. Sorry about that ;-). However it is not as massive > as you might deduct from a first view. 0) run tasks concurrently across dag > runs 1) ordering of the tasks was added to the loop. 2) calculating of > deadlocks, running tasks, tasks to run was corrected, 3) relying on the > executor for status updates was replaced, 4) (tbd) executor failure check > to protect against endless Ioops. > > 0+1 seem bigger than they are due to the amount of lines changed. 2 is a > subtle change, that touches a couple of lines to pop/push properly. 3) is > bigger, as I didn't like the reliance on the executor. 4) is old code that > needs to be added again. > > I probably can leave out 3 which makes 4 mood. The change would be > smaller. Maybe I could even completely remove 3 and just add 4. What are > your thoughts? > > The random failures we were seeing were the "implicit" test of not a > executing in the right order and then deadlocking. But no explicit tests > exist. Help would definitely be appreciated. > > Yes I thought about using the scheduler and/or reusing logic from the > scheduler. I even experimented a little with it but it didn't allow me to > pass the tests effectively. > > What I am planning to do is split the function and make it unit testable > if you agree with the current approach. > > Bolke > > Sent from my iPhone > > > On 27 Feb 2017, at 18:35, Maxime Beauchemin <[email protected]> > wrote: > > > > This PR is pretty massive and complex! It looks like solid work but let's > > be really careful around testing and rolling this out. > > > > This may be out of scope for this PR, but wanted to discuss the idea of > > using the scheduler's logic to perform backfills. It'd be nice to have > that > > logic in one place, though I lost grasp on the details around feasibility > > around this approach. I'm sure you looked into this option before issuing > > this PR and I'm curious to hear your thoughts on blockers/challenges > around > > this alternate approach. > > > > Also I'm wondering whether we have any sort of mechanisms in our > > integration test to validate that task dependencies are respected and run > > in the right order. If not I was thinking we could build some abstraction > > to make it easy to write this type of tests in an expressive way. > > > > ``` > > #[some code to run a backfill, or a scheduler session] > > it = IntegrationTestResults(dag_id='exmaple1') > > assert it.ran_before('task1', 'task_2') > > assert ti.overlapped('task1', 'task_3') # confirms 2 tasks ran in > parallel > > assert ti.none_failed() > > assert ti.ran_last('root') > > assert ti.max_concurrency_reached() == POOL_LIMIT > > ``` > > > > Max > > > >> On Mon, Feb 27, 2017 at 5:41 AM, Bolke de Bruin <[email protected]> > wrote: > >> > >> I have worked in the Backfill issue also in collaboration with Jeremiah. > >> > >> The refactor to use dag runs in backfills caused a regression > >> in task execution performance as dag runs were executed > >> sequentially. Next to that, the backfills were non deterministic > >> due to the random execution of tasks, causing root tasks > >> being added to the non ready list too soon. > >> > >> This updates the backfill logic as follows: > >> > >> • Parallelize execution of tasks > >> • Use a leave first execution model; Breadth-first algorithm by > >> Jerermiah > >> • Replace state updates from the executor by task based only > >> updates > >> > >> https://github.com/apache/incubator-airflow/pull/2107 > >> > >> Please review and test properly. > >> > >> What has been left out at the moment is the checking the executor itself > >> for multiple failures of a task run, where the task itself was never > able > >> to execute. Let me know if this is a real world scenario (maybe when > disk > >> space issue?). I will add it back in. > >> > >> - Bolke > >> > >> > >>> On 25 Feb 2017, at 09:07, Bolke de Bruin <[email protected]> wrote: > >>> > >>> Hi Dan, > >>> > >>> - Backfill indeed runs only one dagrun at the time, see line 1755 of > >> jobs.py. I’ll think about how to fix this over the weekend (I think it > was > >> my change that introduced this). Suggestions always welcome. Depending > the > >> impact it is a blocker or not. We don’t often use backfills and > definitely > >> not at your size, so that is why it didn’t pop up with us. I’m assuming > >> blocker for now, btw. > >>> - Speculation on the High DB Load. I’m not sure what your benchmark is > >> here (1.7.1 + multi processor dags?), but as you mentioned in the code > >> dependencies are checked a couple of times for one run and even task > >> instance. Dependency checking requires aggregation on the DB, which is a > >> performance killer. Annoying but not a blocker. > >>> - Skipped tasks potentially cause a dagrun to be marked failure/success > >> prematurely. BranchOperators are widely used if it affects these > operators, > >> then it is a blocker. > >>> > >>> - Bolke > >>> > >>>> On 25 Feb 2017, at 02:04, Dan Davydov <[email protected]. > INVALID> > >> wrote: > >>>> > >>>> Update on old pending issues: > >>>> - Black Squares in UI: Fix merged > >>>> - Double Trigger Issue That Alex G Mentioned: Alex has a PR in flight > >>>> > >>>> New Issues: > >>>> - Backfill seems to be having issues (only running one dagrun at a > >> time), > >>>> we are still investigating - might be a blocker > >>>> - High DB Load (~8x more than 1.7) - We are still investigating but > it's > >>>> probably not a blocker for the release > >>>> - Skipped tasks potentially cause a dagrun to be marked as > >> failure/success > >>>> prematurely - not sure whether or not to classify this as a blocker > >> (only > >>>> really an issue for users who use the BranchingPythonOperator, which > >> AirBnB > >>>> does) > >>>> > >>>> On Thu, Feb 23, 2017 at 5:59 PM, siddharth anand <[email protected]> > >> wrote: > >>>> > >>>>> IMHO, a DAG run without a start date is non-sensical but is not > >> enforced > >>>>> That said, our UI allows for the manual creation of DAG Runs without > a > >>>>> start date as shown in the images below: > >>>>> > >>>>> > >>>>> - https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot% > >>>>> 202017-02-22%2016.00.40.png?dl=0 > >>>>> <https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot% > >>>>> 202017-02-22%2016.00.40.png?dl=0> > >>>>> - https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot% > >>>>> 202017-02-22%2016.02.22.png?dl=0 > >>>>> <https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot% > >>>>> 202017-02-22%2016.02.22.png?dl=0> > >>>>> > >>>>> > >>>>> On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> Our database may have edge cases that could be associated with > running > >>>>> any > >>>>>> previous version that may or may not have been part of an official > >>>>> release. > >>>>>> > >>>>>> Let's see if anyone else reports the issue. If no one does, one > >> option is > >>>>>> to release 1.8.0 as is with a comment in the release notes, and > have a > >>>>>> future official minor apache release 1.8.1 that would fix these > minor > >>>>>> issues that are not deal breaker. > >>>>>> > >>>>>> @bolke, I'm curious, how long does it take you to go through one > >> release > >>>>>> cycle? Oh, and do you have a documented step by step process for > >>>>> releasing? > >>>>>> I'd like to add the Pypi part to this doc and add committers that > are > >>>>>> interested to have rights on the project on Pypi. > >>>>>> > >>>>>> Max > >>>>>> > >>>>>> On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin <[email protected]> > >>>>> wrote: > >>>>>> > >>>>>>> So it is a database integrity issue? Afaik a start_date should > always > >>>>> be > >>>>>>> set for a DagRun (create_dagrun) does so I didn't check the code > >>>>> though. > >>>>>>> > >>>>>>> Sent from my iPhone > >>>>>>> > >>>>>>>> On 22 Feb 2017, at 22:19, Dan Davydov <[email protected]. > >>>>> INVALID> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Should clarify this occurs when a dagrun does not have a start > date, > >>>>>> not > >>>>>>> a > >>>>>>>> dag (which makes it even less likely to happen). I don't think > this > >>>>> is > >>>>>> a > >>>>>>>> blocker for releasing. > >>>>>>>> > >>>>>>>>> On Wed, Feb 22, 2017 at 1:15 PM, Dan Davydov < > >>>>> [email protected]> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> I rolled this out in our prod and the webservers failed to load > due > >>>>> to > >>>>>>>>> this commit: > >>>>>>>>> > >>>>>>>>> [AIRFLOW-510] Filter Paused Dags, show Last Run & Trigger Dag > >>>>>>>>> 7c94d81c390881643f94d5e3d7d6fb351a445b72 > >>>>>>>>> > >>>>>>>>> This fixed it: > >>>>>>>>> - </a> <span id="statuses_info" > >>>>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true" > >>>>> title="Start > >>>>>>> Date: > >>>>>>>>> {{last_run.start_date.strftime('%Y-%m-%d %H:%M')}}"></span> > >>>>>>>>> + </a> <span id="statuses_info" > >>>>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> > >>>>>>>>> > >>>>>>>>> This is caused by assuming that all DAGs have start dates set, > so a > >>>>>>> broken > >>>>>>>>> DAG will take down the whole UI. Not sure if we want to make > this a > >>>>>>> blocker > >>>>>>>>> for the release or not, I'm guessing for most deployments this > >> would > >>>>>>> occur > >>>>>>>>> pretty rarely. I'll submit a PR to fix it soon. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, Feb 21, 2017 at 9:49 AM, Chris Riccomini < > >>>>>> [email protected] > >>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Ack that the vote has already passed, but belated +1 (binding) > >>>>>>>>>> > >>>>>>>>>> On Tue, Feb 21, 2017 at 7:42 AM, Bolke de Bruin < > >> [email protected] > >>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> IPMC Voting can be found here: > >>>>>>>>>>> > >>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/ > >>>>>>>>>> 201702.mbox/% > >>>>>>>>>>> [email protected]%3e < > >>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/ > >>>>>>>>>> 201702.mbox/% > >>>>>>>>>>> [email protected]%3E> > >>>>>>>>>>> > >>>>>>>>>>> Kind regards, > >>>>>>>>>>> Bolke > >>>>>>>>>>> > >>>>>>>>>>>> On 21 Feb 2017, at 08:20, Bolke de Bruin <[email protected]> > >>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Hello, > >>>>>>>>>>>> > >>>>>>>>>>>> Apache Airflow (incubating) 1.8.0 (based on RC4) has been > >>>>> accepted. > >>>>>>>>>>>> > >>>>>>>>>>>> 9 “+1” votes received: > >>>>>>>>>>>> > >>>>>>>>>>>> - Maxime Beauchemin (binding) > >>>>>>>>>>>> - Arthur Wiedmer (binding) > >>>>>>>>>>>> - Dan Davydov (binding) > >>>>>>>>>>>> - Jeremiah Lowin (binding) > >>>>>>>>>>>> - Siddharth Anand (binding) > >>>>>>>>>>>> - Alex van Boxel (binding) > >>>>>>>>>>>> - Bolke de Bruin (binding) > >>>>>>>>>>>> > >>>>>>>>>>>> - Jayesh Senjaliya (non-binding) > >>>>>>>>>>>> - Yi (non-binding) > >>>>>>>>>>>> > >>>>>>>>>>>> Vote thread (start): > >>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator- > >>>>>>>>>>> airflow-dev/201702.mbox/%3cD360D9BE-C358-42A1-9188- > >>>>>>>>>>> [email protected]%3e <http://mail-archives.apache. > >>>>>>>>>>> org/mod_mbox/incubator-airflow-dev/201702.mbox/%3C7EB7B6D6- > >>>>>>>>>> 092E-48D2-AA0F- > >>>>>>>>>>> [email protected]%3E> > >>>>>>>>>>>> > >>>>>>>>>>>> Next steps: > >>>>>>>>>>>> 1) will start the voting process at the IPMC mailinglist. I do > >>>>>> expect > >>>>>>>>>>> some changes to be required mostly in documentation maybe a > >>>>> license > >>>>>>> here > >>>>>>>>>>> and there. So, we might end up with changes to stable. As long > as > >>>>>>> these > >>>>>>>>>> are > >>>>>>>>>>> not (significant) code changes I will not re-raise the vote. > >>>>>>>>>>>> 2) Only after the positive voting on the IPMC and > finalisation I > >>>>>> will > >>>>>>>>>>> rebrand the RC to Release. > >>>>>>>>>>>> 3) I will upload it to the incubator release page, then the > tar > >>>>>> ball > >>>>>>>>>>> needs to propagate to the mirrors. > >>>>>>>>>>>> 4) Update the website (can someone volunteer please?) > >>>>>>>>>>>> 5) Finally, I will ask Maxime to upload it to pypi. It seems > we > >>>>> can > >>>>>>>>>> keep > >>>>>>>>>>> the apache branding as lib cloud is doing this as well ( > >>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package < > >>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package>). > >>>>>>>>>>>> > >>>>>>>>>>>> Jippie! > >>>>>>>>>>>> > >>>>>>>>>>>> Bolke > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > >> >
