Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

Joshua Harlow Sat, 11 Jan 2014 09:38:16 -0800

+1

Very interesting to read about these bottlenecks and very grateful they are 
being addressed.


Sent from my really tiny device...

> On Jan 11, 2014, at 8:44 AM, "Sean Dague" <[email protected]> wrote:
> 
> First, thanks a ton for diving in on all this Russell. The big push by the 
> Nova team recently is really helpful.
> 
>> On 01/11/2014 09:57 AM, Russell Bryant wrote:
>>> On 01/09/2014 04:16 PM, Russell Bryant wrote:
>>>> On 01/08/2014 05:53 PM, Joe Gordon wrote:
>>>> Hi All,
>>>> 
>>>> As you know the gate has been in particularly bad shape (gate queue over
>>>> 100!) this week due to a number of factors. One factor is how many major
>>>> outstanding bugs we have in the gate.  Below is a list of the top 4 open
>>>> gate bugs.
>>>> 
>>>> Here are some fun facts about this list:
>>>> * All bugs have been open for over a month
>>>> * All are nova bugs
>>>> * These 4 bugs alone were hit 588 times which averages to 42 hits per
>>>> day (data is over two weeks)!
>>>> 
>>>> If we want the gate queue to drop and not have to continuously run
>>>> 'recheck bug x' we need to fix these bugs.  So I'm looking for
>>>> volunteers to help debug and fix these bugs.
>>> 
>>> I created the following etherpad to help track the most important Nova
>>> gate bugs. who is actively working on them, and any patches that we have
>>> in flight to help address them:
>>> 
>>>   https://etherpad.openstack.org/p/nova-gate-issue-tracking
>>> 
>>> Please jump in if you can.  We shouldn't wait for the gate bug day to
>>> move on these.  Even if others are already looking at a bug, feel free
>>> to do the same.  We need multiple sets of eyes on each of these issues.
>> 
>> Some good progress from the last few days:
>> 
>> After looking at a lot of failures, we determined that the vast majority
>> of failures are performance related.  The load being put on the
>> OpenStack deployment is just too high.  We're working to address this to
>> make the gate more reliable in a number of ways.
>> 
>> 1) (merged) https://review.openstack.org/#/c/65760/
>> 
>> The large-ops test was cut back from spawning 100 instances to 50.  From
>> the commit message:
>> 
>>   It turns out the variance in cloud instances is very high, especially
>>   when comparing different cloud providers and regions. This test was
>>   originally added as a regression test for the nova-network issues with
>>   rootwrap. At which time this test wouldn't pass for 30 instances.  So
>>   50 is still a valid regression test.
>> 
>> 2) (merged) https://review.openstack.org/#/c/45766/
>> 
>> nova-compute is able to do work in parallel very well.  nova-conductor
>> can not by default due to the details of our use of eventlet + how we
>> talk to MySQL.  The way you allow nova-conductor to do its work in
>> parallel is by running multiple conductor workers.  We had not enabled
>> this by default in devstack, so our 4 vCPU test nodes were only using a
>> single conductor worker.  They now use 4 conductor workers.
>> 
>> 3) (still testing) https://review.openstack.org/#/c/65805/
>> 
>> Right now when tempest runs in the devstack-gate jobs, it runs with
>> concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
>> this maxes out the deployment and results in timeouts (usually network
>> related).
>> 
>> This patch changes tempest concurrency to 2 instead of 4.  The initial
>> results are quite promising.  The tests have been passing reliably so
>> far, but we're going to continue to recheck this for a while longer for
>> more data.
>> 
>> One very interesting observation on this came from Jim where he said "A
>> quick glance suggests 1.2x -- 1.4x change in runtime."  If the
>> deployment were *not* being maxed out, we would expect this change to
>> result in much closer to a 2x runtime increase.
> 
> We could also address this by locally turning up timeouts on operations that 
> are timing out. Which would let those things take the time they need.
> 
> Before dropping the concurrency I'd really like to make sure we can point to 
> specific fails that we think will go away. There was a lot of speculation 
> around nova-network, however the nova-network timeout errors only pop up on 
> elastic search on large-ops jobs, not normal tempest jobs. Definitely making 
> OpenStack more idle will make more tests pass. The Neutron team has 
> experienced that.
> 
> It would be a ton better if we could actually feed back a 503 with a retry 
> time (which I realize is a ton of work).
> 
> Because if we decide we're now always pinned to only 2way, we have to start 
> doing some major rethinking on our test strategy, as we'll be way outside the 
> soft 45min time budget we've been trying to operate on. We'd actually been 
> planning on going up to 8way, but were waiting for some issues to get fixed 
> before we did that. It would sort of immediately put a moratorium on new 
> tests. If that's what we need to do, that's what we need to do, but we should 
> talk it through.
> 
>> 4) (approved, not yet merged) https://review.openstack.org/#/c/65784/
>> 
>> nova-network seems to be the largest bottleneck in terms of performance
>> problems when nova is maxed out on these test nodes.  This patch is one
>> quick speedup we can make by not using rootwrap in a few cases where it
>> wasn't necessary.  These really add up.
>> 
>> 5) https://review.openstack.org/#/c/65989/
>> 
>> This patch isn't a candidate for merging, but was written to test the
>> theory that by updating nova-network to use conductor instead of direct
>> database access, nova-network will be able to do work in parallel better
>> than it does today, just as we have observed with nova-compute.
>> 
>> Dan's initial test results from this are **very** promising.  Initial
>> testing showed a 20% speedup in runtime and a 33% decrease in CPU
>> consumption by nova-network.
>> 
>> Doing this properly will not be quick, but I'm hopeful that we can
>> complete it by the Icehouse release.  We will need to convert
>> nova-network to use Nova's object model.  Much of this work is starting
>> to catch nova-network up on work that we've been doing in the rest of
>> the tree but have passed on doing for nova-network due to nova-network
>> being in a freeze.
> 
> I'm a huge +1 on fixing this in nova-network.
> 
>> 6) (no patch yet)
>> 
>> We haven't had time to dive too deep into this yet, but we would also
>> like to revisit our locking usage and how it is affecting nova-network
>> performance.  There may be some more significant improvements we can
>> make there.
>> 
>> 
>> Final notes:
>> 
>> I am hopeful that by addressing these performance issues both in Nova's
>> code, as well as by turning down the test load, that we will see a
>> significant increase in gate reliability in the near future.  I
>> apologize on behalf of the Nova team for Nova's contribution to gate
>> instability.
>> 
>> *Thank you* to everyone who has been helping out!
> 
> Yes, thanks much to everyone here.
> 
>    -Sean
> 
> -- 
> Sean Dague
> Samsung Research America
> [email protected] / [email protected]
> http://dague.net
> 
> _______________________________________________
> OpenStack-dev mailing list
> [email protected]
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova][neutron] top gate bugs: a plea for help

Reply via email to