On 7/15/07, Alan Robertson <[EMAIL PROTECTED]> wrote:

Hi,

There seems to have been a lot of interest in testing lately, and it
occurred to me that most of you don't have much of an idea how we test,
and why we test that way, and how it's worked out for us.

A little history...
When I first started this project back in 1998, I put out several
releases in a row which were quite good.  Then, I put out something like
2 or 3 releases in 6 hours - to fix problems that I should have caught -
but I didn't, because the manual testing needed was so onerous, so I
hadn't done enough of it :-(.

I have little tolerance for this kind of thing in general, and even less
when I do it ;-).  So, I spent the next several weeks writing CTS - the
cluster testing suite, so that I could automate much of the testing.

CTS is a random test suite - that is, it has a battery of tests it knows
how to do, and it runs them in a random order with appropriate random
parameters.

This kind of testing is essential for clusters, because some of the
nastiest problems you run up against in this kind of situation are
timing issues.  Timing between processes on a server, and timing between
servers.

Usually a given test succeeds most of the time it runs.  If it is run
100 times, it's not uncommon for it to fail only once, or maybe only 5
times.  Sometimes this is because of interactions with the sequence of
tests in front of it, but more often, failures like this occur because
it doesn't run with the same timing each time.

It also turns out that some clusters tend to create more timing problems
than others.  For example, if you have a cluster of a bunch of identical
fast machines, with lots of RAM you may run 5000 CTS iterations
flawlessly (this takes several days to a week).  If you put this same
software on an asymmetric cluster with a wide divergence between amount
of RAM and CPU speeds, then it is not unheard of to see it fail in
minutes.

This is a phenomenon which we have seen happen many times.

This is not completely surprising, since the random nature of the tests
combined with the divergent speeds tends to be much better at generating
timing problems.  Although people rarely put together clusters with so
diverse a set of machines, it is not at all uncommon to see a number of
machines in a cluster under radically different loads.  Indeed, the
typical "n+1" sparing guarantees an idle machine in the cluster,
generating an effect similar to that of a diverse cluster when some
machines are under heavy load.

We have access to 4 or 5 different clusters when we test.  Most of them
are symmetric clusters of fast Intel machines.  However, two stand out
as being especially valuable at finding timing problems.

These are:
    a 6-node cluster consisting of 6 System p (Power PC) virtual
        machines - each getting .1 of the CPU.  This cluster
        produces noticeably more problems than the fast clusters.
        This cluster is behind an IBM firewall, and is unfortunately
        only accessible to IBMers.

    a random collection of mostly old cast-off computers.  They range
        between 300mhz and 2.4 ghz, and have disks ranging from
        5200 to 7200 RPM with significantly different seek times.
        Some support DMA, and I believe some of them don't, and the
        amount of RAM also varies.

        This particular cluster is head-and-shoulders above "normal"
        clusters in finding timing problems.  On a few occasions
        it wouldn't even run 10 iterations without several problems
        popping up - on a version where another cluster had run
        well over 5000 tests flawlessly.  When this first happened
        there were quite a few harsh words exchanged, until we
        figured out that it wasn't the diligence of the tester,
        it was just the cluster doing the testing.

To our experience, if our tests run flawlessly on these two clusters,
they will run flawlessly on the "good" clusters.

As a result, I put in a good bit of my time and some of my money into
keeping this crappy old cluster going, and paying for the power to keep
it on - because it's incredibly valuable to the project.  For reasons we
don't completely understand, it has a (so far) unique role in finding
timing problems.  All the developers on the project have easy access to
this cluster, just check with me to make sure I'm not using it.

Obviously, timing problems aren't the only problems one encounters, but
they crop up during the testing of essentially every release.  Sometimes
they're new, and sometimes things have changed to make them more likely
to happen.  And, occasionally, we just seem to get (un-)lucky for a
particular run of the tests.

To summarize:

        Every official project release passes through these two clusters

        Any major set of changes which hasn't done so probably
                has undiscovered timing problems in it.
                Of course, we don't claim to find them all.
                So far this seems to have been very effective at
                keeping them out of _your_ clusters ;-).

        When a project partner works with us on their release schedule,
                we try hard to put out a project release just before
                their release - that's been tested by these two
                cantankerous clusters.  Of course, all developers
                have access to this cluster on their own.

Hope this background helps people understand a little better about how
we test.


Lets call a spade a spade shall we...

This is a thinly veiled put-down of the people who have been doing Alan's
job for the last 7 months.


As one of those people I don't take kindly to the inference that I have been
releasing substandard packages.  Particularly given that I am the author of
the majority of the HAv2 code and therefor have arguably the most interest
in its quality.



Alan's attempts to reassert control over a project he routinely ignores are
almost always at the expense of the reputation and hard work of others.
This, like the recent threats of censorship, is something I will no longer
tolerate.



While it is true that in the past Alan's home cluster threw up some
interesting bugs, it is also true that it has not been the unique source of
a bug in over a year.  There are now a number of other clusters out there
that do a better job at finding problems.



I'll not comment on the statements about release planning which are
laughable, if not outright insulting.


Instead I will simply point people in the direction of:

  http://wiki.linux-ha.org/RoadMap?action=info

and

  http://hg.linux-ha.org/dev/log/tip/MasterPlan.planner

and let you draw your own conclusions.



Had I the choice, I would gladly leave putting out releases to others.


However I see no reason to trust Alan with the task of providing timely
access to bug fixes for users of most major distributions.  I no longer have
confidence in his leadership of the project and it remains to be seen how
long his renewed interest will last.


So I will continue to provide well tested, high quality package updates and
people can make their own choice.


And please dont forget
http://software.opensuse.org/download/home:/LarsMB/for those people
with the time and inclination to help us track down for
bugs before they are released.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to