I created https://issues.apache.org/jira/browse/AIRFLOW-921 to track the
pending issues.

There are two more issues we found which I included there:
1. Task instances that have their state manually set to running make the UI
for their DAG unable to parse
2. Mark success doesn't work for non existent task instances/dagruns which
breaks the subdag use case (setting tasks as successful via the graph view)

On Mon, Feb 27, 2017 at 11:06 AM, Bolke de Bruin <[email protected]> wrote:

> Hey Max
>
> It is massive for sure. Sorry about that ;-). However it is not as massive
> as you might deduct from a first view. 0) run tasks concurrently across dag
> runs 1) ordering of the tasks was added to the loop. 2) calculating of
> deadlocks, running tasks, tasks to run was corrected, 3) relying on the
> executor for status updates was replaced, 4) (tbd) executor failure check
> to protect against endless Ioops.
>
> 0+1 seem bigger than they are due to the amount of lines changed. 2 is a
> subtle change, that touches a couple of lines to pop/push properly. 3) is
> bigger, as I didn't like the reliance on the executor. 4) is old code that
> needs to be added again.
>
> I probably can leave out 3 which makes 4 mood. The change would be
> smaller. Maybe I could even completely remove 3 and just add 4. What are
> your thoughts?
>
> The random failures we were seeing were the "implicit" test of not a
> executing in the right order and then deadlocking. But no explicit tests
> exist. Help would definitely be appreciated.
>
> Yes I thought about using the scheduler and/or reusing logic from the
> scheduler. I even experimented a little with it but it didn't allow me to
> pass the tests effectively.
>
> What I am planning to do is split the function and make it unit testable
> if you agree with the current approach.
>
> Bolke
>
> Sent from my iPhone
>
> > On 27 Feb 2017, at 18:35, Maxime Beauchemin <[email protected]>
> wrote:
> >
> > This PR is pretty massive and complex! It looks like solid work but let's
> > be really careful around testing and rolling this out.
> >
> > This may be out of scope for this PR, but wanted to discuss the idea of
> > using the scheduler's logic to perform backfills. It'd be nice to have
> that
> > logic in one place, though I lost grasp on the details around feasibility
> > around this approach. I'm sure you looked into this option before issuing
> > this PR and I'm curious to hear your thoughts on blockers/challenges
> around
> > this alternate approach.
> >
> > Also I'm wondering whether we have any sort of mechanisms in our
> > integration test to validate that task dependencies are respected and run
> > in the right order. If not I was thinking we could build some abstraction
> > to make it easy to write this type of tests in an expressive way.
> >
> > ```
> > #[some code to run a backfill, or a scheduler session]
> > it = IntegrationTestResults(dag_id='exmaple1')
> > assert it.ran_before('task1', 'task_2')
> > assert ti.overlapped('task1', 'task_3') # confirms 2 tasks ran in
> parallel
> > assert ti.none_failed()
> > assert ti.ran_last('root')
> > assert ti.max_concurrency_reached() == POOL_LIMIT
> > ```
> >
> > Max
> >
> >> On Mon, Feb 27, 2017 at 5:41 AM, Bolke de Bruin <[email protected]>
> wrote:
> >>
> >> I have worked in the Backfill issue also in collaboration with Jeremiah.
> >>
> >> The refactor to use dag runs in backfills caused a regression
> >> in task execution performance as dag runs were executed
> >> sequentially. Next to that, the backfills were non deterministic
> >> due to the random execution of tasks, causing root tasks
> >> being added to the non ready list too soon.
> >>
> >> This updates the backfill logic as follows:
> >>
> >>        • Parallelize execution of tasks
> >>        • Use a leave first execution model; Breadth-first algorithm by
> >> Jerermiah
> >>        • Replace state updates from the executor by task based only
> >> updates
> >>
> >> https://github.com/apache/incubator-airflow/pull/2107
> >>
> >> Please review and test properly.
> >>
> >> What has been left out at the moment is the checking the executor itself
> >> for multiple failures of a task run, where the task itself was never
> able
> >> to execute. Let me know if this is a real world scenario (maybe when
> disk
> >> space issue?). I will add it back in.
> >>
> >> - Bolke
> >>
> >>
> >>> On 25 Feb 2017, at 09:07, Bolke de Bruin <[email protected]> wrote:
> >>>
> >>> Hi Dan,
> >>>
> >>> - Backfill indeed runs only one dagrun at the time, see line 1755 of
> >> jobs.py. I’ll think about how to fix this over the weekend (I think it
> was
> >> my change that introduced this). Suggestions always welcome. Depending
> the
> >> impact it is a blocker or not. We don’t often use backfills and
> definitely
> >> not at your size, so that is why it didn’t pop up with us. I’m assuming
> >> blocker for now, btw.
> >>> - Speculation on the High DB Load. I’m not sure what your benchmark is
> >> here (1.7.1 + multi processor dags?), but as you mentioned in the code
> >> dependencies are checked a couple of times for one run and even task
> >> instance. Dependency checking requires aggregation on the DB, which is a
> >> performance killer. Annoying but not a blocker.
> >>> - Skipped tasks potentially cause a dagrun to be marked failure/success
> >> prematurely. BranchOperators are widely used if it affects these
> operators,
> >> then it is a blocker.
> >>>
> >>> - Bolke
> >>>
> >>>> On 25 Feb 2017, at 02:04, Dan Davydov <[email protected].
> INVALID>
> >> wrote:
> >>>>
> >>>> Update on old pending issues:
> >>>> - Black Squares in UI: Fix merged
> >>>> - Double Trigger Issue That Alex G Mentioned: Alex has a PR in flight
> >>>>
> >>>> New Issues:
> >>>> - Backfill seems to be having issues (only running one dagrun at a
> >> time),
> >>>> we are still investigating - might be a blocker
> >>>> - High DB Load (~8x more than 1.7) - We are still investigating but
> it's
> >>>> probably not a blocker for the release
> >>>> - Skipped tasks potentially cause a dagrun to be marked as
> >> failure/success
> >>>> prematurely - not sure whether or not to classify this as a blocker
> >> (only
> >>>> really an issue for users who use the BranchingPythonOperator, which
> >> AirBnB
> >>>> does)
> >>>>
> >>>> On Thu, Feb 23, 2017 at 5:59 PM, siddharth anand <[email protected]>
> >> wrote:
> >>>>
> >>>>> IMHO, a DAG run without a start date is non-sensical but is not
> >> enforced
> >>>>> That said, our UI allows for the manual creation of DAG Runs without
> a
> >>>>> start date as shown in the images below:
> >>>>>
> >>>>>
> >>>>> - https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
> >>>>> 202017-02-22%2016.00.40.png?dl=0
> >>>>> <https://www.dropbox.com/s/3sxcqh04eztpl7p/Screenshot%
> >>>>> 202017-02-22%2016.00.40.png?dl=0>
> >>>>> - https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
> >>>>> 202017-02-22%2016.02.22.png?dl=0
> >>>>> <https://www.dropbox.com/s/4q6rr9dwghag1yy/Screenshot%
> >>>>> 202017-02-22%2016.02.22.png?dl=0>
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 22, 2017 at 2:26 PM, Maxime Beauchemin <
> >>>>> [email protected]> wrote:
> >>>>>
> >>>>>> Our database may have edge cases that could be associated with
> running
> >>>>> any
> >>>>>> previous version that may or may not have been part of an official
> >>>>> release.
> >>>>>>
> >>>>>> Let's see if anyone else reports the issue. If no one does, one
> >> option is
> >>>>>> to release 1.8.0 as is with a comment in the release notes, and
> have a
> >>>>>> future official minor apache release 1.8.1 that would fix these
> minor
> >>>>>> issues that are not deal breaker.
> >>>>>>
> >>>>>> @bolke, I'm curious, how long does it take you to go through one
> >> release
> >>>>>> cycle? Oh, and do you have a documented step by step process for
> >>>>> releasing?
> >>>>>> I'd like to add the Pypi part to this doc and add committers that
> are
> >>>>>> interested to have rights on the project on Pypi.
> >>>>>>
> >>>>>> Max
> >>>>>>
> >>>>>> On Wed, Feb 22, 2017 at 2:00 PM, Bolke de Bruin <[email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>>> So it is a database integrity issue? Afaik a start_date should
> always
> >>>>> be
> >>>>>>> set for a DagRun (create_dagrun) does so  I didn't check the code
> >>>>> though.
> >>>>>>>
> >>>>>>> Sent from my iPhone
> >>>>>>>
> >>>>>>>> On 22 Feb 2017, at 22:19, Dan Davydov <[email protected].
> >>>>> INVALID>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Should clarify this occurs when a dagrun does not have a start
> date,
> >>>>>> not
> >>>>>>> a
> >>>>>>>> dag (which makes it even less likely to happen). I don't think
> this
> >>>>> is
> >>>>>> a
> >>>>>>>> blocker for releasing.
> >>>>>>>>
> >>>>>>>>> On Wed, Feb 22, 2017 at 1:15 PM, Dan Davydov <
> >>>>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> I rolled this out in our prod and the webservers failed to load
> due
> >>>>> to
> >>>>>>>>> this commit:
> >>>>>>>>>
> >>>>>>>>> [AIRFLOW-510] Filter Paused Dags, show Last Run & Trigger Dag
> >>>>>>>>> 7c94d81c390881643f94d5e3d7d6fb351a445b72
> >>>>>>>>>
> >>>>>>>>> This fixed it:
> >>>>>>>>> -                            </a> <span id="statuses_info"
> >>>>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true"
> >>>>> title="Start
> >>>>>>> Date:
> >>>>>>>>> {{last_run.start_date.strftime('%Y-%m-%d %H:%M')}}"></span>
> >>>>>>>>> +                            </a> <span id="statuses_info"
> >>>>>>>>> class="glyphicon glyphicon-info-sign" aria-hidden="true"></span>
> >>>>>>>>>
> >>>>>>>>> This is caused by assuming that all DAGs have start dates set,
> so a
> >>>>>>> broken
> >>>>>>>>> DAG will take down the whole UI. Not sure if we want to make
> this a
> >>>>>>> blocker
> >>>>>>>>> for the release or not, I'm guessing for most deployments this
> >> would
> >>>>>>> occur
> >>>>>>>>> pretty rarely. I'll submit a PR to fix it soon.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 21, 2017 at 9:49 AM, Chris Riccomini <
> >>>>>> [email protected]
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Ack that the vote has already passed, but belated +1 (binding)
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 21, 2017 at 7:42 AM, Bolke de Bruin <
> >> [email protected]
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> IPMC Voting can be found here:
> >>>>>>>>>>>
> >>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
> >>>>>>>>>> 201702.mbox/%
> >>>>>>>>>>> [email protected]%3e <
> >>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-general/
> >>>>>>>>>> 201702.mbox/%
> >>>>>>>>>>> [email protected]%3E>
> >>>>>>>>>>>
> >>>>>>>>>>> Kind regards,
> >>>>>>>>>>> Bolke
> >>>>>>>>>>>
> >>>>>>>>>>>> On 21 Feb 2017, at 08:20, Bolke de Bruin <[email protected]>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Apache Airflow (incubating) 1.8.0 (based on RC4) has been
> >>>>> accepted.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 9 “+1” votes received:
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Maxime Beauchemin (binding)
> >>>>>>>>>>>> - Arthur Wiedmer (binding)
> >>>>>>>>>>>> - Dan Davydov (binding)
> >>>>>>>>>>>> - Jeremiah Lowin (binding)
> >>>>>>>>>>>> - Siddharth Anand (binding)
> >>>>>>>>>>>> - Alex van Boxel (binding)
> >>>>>>>>>>>> - Bolke de Bruin (binding)
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Jayesh Senjaliya (non-binding)
> >>>>>>>>>>>> - Yi (non-binding)
> >>>>>>>>>>>>
> >>>>>>>>>>>> Vote thread (start):
> >>>>>>>>>>>> http://mail-archives.apache.org/mod_mbox/incubator-
> >>>>>>>>>>> airflow-dev/201702.mbox/%3cD360D9BE-C358-42A1-9188-
> >>>>>>>>>>> [email protected]%3e <http://mail-archives.apache.
> >>>>>>>>>>> org/mod_mbox/incubator-airflow-dev/201702.mbox/%3C7EB7B6D6-
> >>>>>>>>>> 092E-48D2-AA0F-
> >>>>>>>>>>> [email protected]%3E>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Next steps:
> >>>>>>>>>>>> 1) will start the voting process at the IPMC mailinglist. I do
> >>>>>> expect
> >>>>>>>>>>> some changes to be required mostly in documentation maybe a
> >>>>> license
> >>>>>>> here
> >>>>>>>>>>> and there. So, we might end up with changes to stable. As long
> as
> >>>>>>> these
> >>>>>>>>>> are
> >>>>>>>>>>> not (significant) code changes I will not re-raise the vote.
> >>>>>>>>>>>> 2) Only after the positive voting on the IPMC and
> finalisation I
> >>>>>> will
> >>>>>>>>>>> rebrand the RC to Release.
> >>>>>>>>>>>> 3) I will upload it to the incubator release page, then the
> tar
> >>>>>> ball
> >>>>>>>>>>> needs to propagate to the mirrors.
> >>>>>>>>>>>> 4) Update the website (can someone volunteer please?)
> >>>>>>>>>>>> 5) Finally, I will ask Maxime to upload it to pypi. It seems
> we
> >>>>> can
> >>>>>>>>>> keep
> >>>>>>>>>>> the apache branding as lib cloud is doing this as well (
> >>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package <
> >>>>>>>>>>> https://libcloud.apache.org/downloads.html#pypi-package>).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jippie!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Bolke
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> >>
>

Reply via email to