Hi,
There seems to have been a lot of interest in testing lately, and it
occurred to me that most of you don't have much of an idea how we test,
and why we test that way, and how it's worked out for us.
A little history...
When I first started this project back in 1998, I put out several
releases in a row which were quite good. Then, I put out something like
2 or 3 releases in 6 hours - to fix problems that I should have caught -
but I didn't, because the manual testing needed was so onerous, so I
hadn't done enough of it :-(.
I have little tolerance for this kind of thing in general, and even less
when I do it ;-). So, I spent the next several weeks writing CTS - the
cluster testing suite, so that I could automate much of the testing.
CTS is a random test suite - that is, it has a battery of tests it knows
how to do, and it runs them in a random order with appropriate random
parameters.
This kind of testing is essential for clusters, because some of the
nastiest problems you run up against in this kind of situation are
timing issues. Timing between processes on a server, and timing between
servers.
Usually a given test succeeds most of the time it runs. If it is run
100 times, it's not uncommon for it to fail only once, or maybe only 5
times. Sometimes this is because of interactions with the sequence of
tests in front of it, but more often, failures like this occur because
it doesn't run with the same timing each time.
It also turns out that some clusters tend to create more timing problems
than others. For example, if you have a cluster of a bunch of identical
fast machines, with lots of RAM you may run 5000 CTS iterations
flawlessly (this takes several days to a week). If you put this same
software on an asymmetric cluster with a wide divergence between amount
of RAM and CPU speeds, then it is not unheard of to see it fail in minutes.
This is a phenomenon which we have seen happen many times.
This is not completely surprising, since the random nature of the tests
combined with the divergent speeds tends to be much better at generating
timing problems. Although people rarely put together clusters with so
diverse a set of machines, it is not at all uncommon to see a number of
machines in a cluster under radically different loads. Indeed, the
typical "n+1" sparing guarantees an idle machine in the cluster,
generating an effect similar to that of a diverse cluster when some
machines are under heavy load.
We have access to 4 or 5 different clusters when we test. Most of them
are symmetric clusters of fast Intel machines. However, two stand out
as being especially valuable at finding timing problems.
These are:
a 6-node cluster consisting of 6 System p (Power PC) virtual
machines - each getting .1 of the CPU. This cluster
produces noticeably more problems than the fast clusters.
This cluster is behind an IBM firewall, and is unfortunately
only accessible to IBMers.
a random collection of mostly old cast-off computers. They range
between 300mhz and 2.4 ghz, and have disks ranging from
5200 to 7200 RPM with significantly different seek times.
Some support DMA, and I believe some of them don't, and the
amount of RAM also varies.
This particular cluster is head-and-shoulders above "normal"
clusters in finding timing problems. On a few occasions
it wouldn't even run 10 iterations without several problems
popping up - on a version where another cluster had run
well over 5000 tests flawlessly. When this first happened
there were quite a few harsh words exchanged, until we
figured out that it wasn't the diligence of the tester,
it was just the cluster doing the testing.
To our experience, if our tests run flawlessly on these two clusters,
they will run flawlessly on the "good" clusters.
As a result, I put in a good bit of my time and some of my money into
keeping this crappy old cluster going, and paying for the power to keep
it on - because it's incredibly valuable to the project. For reasons we
don't completely understand, it has a (so far) unique role in finding
timing problems. All the developers on the project have easy access to
this cluster, just check with me to make sure I'm not using it.
Obviously, timing problems aren't the only problems one encounters, but
they crop up during the testing of essentially every release. Sometimes
they're new, and sometimes things have changed to make them more likely
to happen. And, occasionally, we just seem to get (un-)lucky for a
particular run of the tests.
To summarize:
Every official project release passes through these two clusters
Any major set of changes which hasn't done so probably
has undiscovered timing problems in it.
Of course, we don't claim to find them all.
So far this seems to have been very effective at
keeping them out of _your_ clusters ;-).
When a project partner works with us on their release schedule,
we try hard to put out a project release just before
their release - that's been tested by these two
cantankerous clusters. Of course, all developers
have access to this cluster on their own.
Hope this background helps people understand a little better about how
we test.
--
Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems