On 2013-08-13 16:39, Clark Boylan wrote:
On Tue, Aug 13, 2013 at 1:25 PM, Matthew Treinish
<mtrein...@kortar.org> wrote:
Hi everyone,
So for the past month or so I've been working on getting tempest to
work stably
with testr in parallel. As part of this you may have noticed the
testr-full
jobs that get run on the zuul check queue. I was using that job to
debug some
of the more obvious race conditions and stability issues with running
tempest
in parallel. After a bunch of fixes to tempest and finding some real
bugs in
some of the projects things seem to have smoothed out.
So I pushed the testr-full run to the gate queue earlier today. I'll
be keeping
track of the success rate of this job vs the serial job and use this
as the
determining factor before we push this live to be the default for all
tempest
runs. So assuming that the success rate matches up well enough with
serial job
on the gate queue then I will push out the change that will migrate
all the
voting jobs to run in parallel hopefully either Friday afternoon or
early next
week. Also, if anyone has any input on what threshold they feel is
good enough
for this I'd welcome any input on that. For example, do we want to
ensure
a >= 1:1 match for job success? Or would something like 90% as stable
as the
serial job be good enough considering the speed advantage. (The
parallel runs
take about half as much time as a full serial run, the parallel job
normally
finishes in ~25-30min) Since this affects almost every project I don't
want to
define this threshold without input from everyone.
After there is some more data for the gate queue's parallel job I'll
have some
pretty graphite graphs that I can share comparing the success trends
between
the parallel and serial jobs.
So at this point we're in the home stretch and I'm asking for
everyone's help
in getting this merged. So, if everyone who is reviewing and pushing
commits
could watch the results from these non-voting jobs and if things fail
on the
parallel job but not the serial job please investigate the failure and
open a
bug if necessary. If it turns out to be a bug in tempest please link
it against
this blueprint:
https://blueprints.launchpad.net/tempest/+spec/speed-up-tempest
so that I'll give it the attention it deserves. I'd hate to get this
close to
getting this merged and have a bit of racy code get merged at the last
second
and block us for another week or two.
I feel that we need to get this in before the H3 rush starts up as it
will help
everyone get through the extra review load faster.
Getting this in before the H3 rush would be very helpful. When we made
the switch with Nova's unittests we fixed as many of the test bugs
that we could find, merged the change to switch the test runner, then
treated all failures as very high priority bugs that received
immediate attention. Getting this in before H3 will give everyone a
little more time to debug any potential new issues exposed by Jenkins
or people running the tests locally.
I think we should be bold here and merge this as soon as we have good
numbers that indicate the trend is for these tests to pass. Graphite
can give us the pass to fail ratios over time, as long as these trends
are similar for both the old nosetest jobs and the new testr job I say
we go for it. (Disclaimer: most of the projecst I work on are not
affected by the tempest jobs; however, I am often called upon to help
sort out issues in the gate).
I'm inclined to agree. It's not as if we don't have transient failures
now, and if we're looking at a 50% speedup in recheck/verify times then
as long as the new version isn't significantly less stable it should be
a net improvement.
Of course, without hard numbers we're kind of discussing in a vacuum
here.
-Ben
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev