+1 Very interesting to read about these bottlenecks and very grateful they are being addressed.
Sent from my really tiny device... > On Jan 11, 2014, at 8:44 AM, "Sean Dague" <s...@dague.net> wrote: > > First, thanks a ton for diving in on all this Russell. The big push by the > Nova team recently is really helpful. > >> On 01/11/2014 09:57 AM, Russell Bryant wrote: >>> On 01/09/2014 04:16 PM, Russell Bryant wrote: >>>> On 01/08/2014 05:53 PM, Joe Gordon wrote: >>>> Hi All, >>>> >>>> As you know the gate has been in particularly bad shape (gate queue over >>>> 100!) this week due to a number of factors. One factor is how many major >>>> outstanding bugs we have in the gate. Below is a list of the top 4 open >>>> gate bugs. >>>> >>>> Here are some fun facts about this list: >>>> * All bugs have been open for over a month >>>> * All are nova bugs >>>> * These 4 bugs alone were hit 588 times which averages to 42 hits per >>>> day (data is over two weeks)! >>>> >>>> If we want the gate queue to drop and not have to continuously run >>>> 'recheck bug x' we need to fix these bugs. So I'm looking for >>>> volunteers to help debug and fix these bugs. >>> >>> I created the following etherpad to help track the most important Nova >>> gate bugs. who is actively working on them, and any patches that we have >>> in flight to help address them: >>> >>> https://etherpad.openstack.org/p/nova-gate-issue-tracking >>> >>> Please jump in if you can. We shouldn't wait for the gate bug day to >>> move on these. Even if others are already looking at a bug, feel free >>> to do the same. We need multiple sets of eyes on each of these issues. >> >> Some good progress from the last few days: >> >> After looking at a lot of failures, we determined that the vast majority >> of failures are performance related. The load being put on the >> OpenStack deployment is just too high. We're working to address this to >> make the gate more reliable in a number of ways. >> >> 1) (merged) https://review.openstack.org/#/c/65760/ >> >> The large-ops test was cut back from spawning 100 instances to 50. From >> the commit message: >> >> It turns out the variance in cloud instances is very high, especially >> when comparing different cloud providers and regions. This test was >> originally added as a regression test for the nova-network issues with >> rootwrap. At which time this test wouldn't pass for 30 instances. So >> 50 is still a valid regression test. >> >> 2) (merged) https://review.openstack.org/#/c/45766/ >> >> nova-compute is able to do work in parallel very well. nova-conductor >> can not by default due to the details of our use of eventlet + how we >> talk to MySQL. The way you allow nova-conductor to do its work in >> parallel is by running multiple conductor workers. We had not enabled >> this by default in devstack, so our 4 vCPU test nodes were only using a >> single conductor worker. They now use 4 conductor workers. >> >> 3) (still testing) https://review.openstack.org/#/c/65805/ >> >> Right now when tempest runs in the devstack-gate jobs, it runs with >> concurrency=4 (run 4 tests at once). Unfortunately, it appears that >> this maxes out the deployment and results in timeouts (usually network >> related). >> >> This patch changes tempest concurrency to 2 instead of 4. The initial >> results are quite promising. The tests have been passing reliably so >> far, but we're going to continue to recheck this for a while longer for >> more data. >> >> One very interesting observation on this came from Jim where he said "A >> quick glance suggests 1.2x -- 1.4x change in runtime." If the >> deployment were *not* being maxed out, we would expect this change to >> result in much closer to a 2x runtime increase. > > We could also address this by locally turning up timeouts on operations that > are timing out. Which would let those things take the time they need. > > Before dropping the concurrency I'd really like to make sure we can point to > specific fails that we think will go away. There was a lot of speculation > around nova-network, however the nova-network timeout errors only pop up on > elastic search on large-ops jobs, not normal tempest jobs. Definitely making > OpenStack more idle will make more tests pass. The Neutron team has > experienced that. > > It would be a ton better if we could actually feed back a 503 with a retry > time (which I realize is a ton of work). > > Because if we decide we're now always pinned to only 2way, we have to start > doing some major rethinking on our test strategy, as we'll be way outside the > soft 45min time budget we've been trying to operate on. We'd actually been > planning on going up to 8way, but were waiting for some issues to get fixed > before we did that. It would sort of immediately put a moratorium on new > tests. If that's what we need to do, that's what we need to do, but we should > talk it through. > >> 4) (approved, not yet merged) https://review.openstack.org/#/c/65784/ >> >> nova-network seems to be the largest bottleneck in terms of performance >> problems when nova is maxed out on these test nodes. This patch is one >> quick speedup we can make by not using rootwrap in a few cases where it >> wasn't necessary. These really add up. >> >> 5) https://review.openstack.org/#/c/65989/ >> >> This patch isn't a candidate for merging, but was written to test the >> theory that by updating nova-network to use conductor instead of direct >> database access, nova-network will be able to do work in parallel better >> than it does today, just as we have observed with nova-compute. >> >> Dan's initial test results from this are **very** promising. Initial >> testing showed a 20% speedup in runtime and a 33% decrease in CPU >> consumption by nova-network. >> >> Doing this properly will not be quick, but I'm hopeful that we can >> complete it by the Icehouse release. We will need to convert >> nova-network to use Nova's object model. Much of this work is starting >> to catch nova-network up on work that we've been doing in the rest of >> the tree but have passed on doing for nova-network due to nova-network >> being in a freeze. > > I'm a huge +1 on fixing this in nova-network. > >> 6) (no patch yet) >> >> We haven't had time to dive too deep into this yet, but we would also >> like to revisit our locking usage and how it is affecting nova-network >> performance. There may be some more significant improvements we can >> make there. >> >> >> Final notes: >> >> I am hopeful that by addressing these performance issues both in Nova's >> code, as well as by turning down the test load, that we will see a >> significant increase in gate reliability in the near future. I >> apologize on behalf of the Nova team for Nova's contribution to gate >> instability. >> >> *Thank you* to everyone who has been helping out! > > Yes, thanks much to everyone here. > > -Sean > > -- > Sean Dague > Samsung Research America > s...@dague.net / sean.da...@samsung.com > http://dague.net > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev