On Jan 19, 2006, at 9:57 PM, Lars Marowsky-Bree wrote:
On 2006-01-19T20:17:58, Peter Kruse <[EMAIL PROTECTED]> wrote:
The problem I observe only manifested after the resources have been
online for about 24h with one or two resource groups with some
resources defined in it. So I'm not sure that the tests you run
really are "real live" so to say. The resource agents really put
some
stress on the cib as they run crm_attribute on every monitor action,
that's about 10 RAs calling crm_attribute every 30 seconds.
Woah, what are you calling crm_attribute for all the time?
Its either an ipfail replacement or his way of getting resources to
run on the node where they've failed the least... I forget which.
This results in the message
"Processing cib_query operation from ..." occuring in syslog about
every
second.
Yeah, we know, logging needs tuning. This one probbably needs to be
tuned down.
Nod. Not logging read-only CIB calls wouldn't affect me too much.
I have two installations running CVS revision from 18.1.2005 running
until now without problems - knock on wood.
2005-01-18? Woah. We should have just not done the last year then ;-)
or half-full: look where we are and its still only Jan 2005! SLES10
is ages away!
Please let me stress this further: Your tests are important to
see if
your code is reliable. Unfortunately they don't seem to be
enough. I
don't want to get into the discussion of "you cannot test
everything".
That is granted.
No, you're of course right, and we need to improve our testing effort
all the time. We'll do a lot more testing as we gear up for shipping
SLES10, for example, that includes more reallife testing by humans.
We try to add testcases for bugs which were fixed, so that we can
catch
regressions, or similar bugs in other pieces of code.
Automated tests can never be a replacement for pilot deployments
though;
they feed on them, but can't replace them ;-)
Agreed. CTS is great for ensuring a good level of reliability before
we put it in anyone's hands. It exercises the cluster in ways and
with frequency we just cant match by hand. But there is no silver
bullet and we need to remember that.
But it seems it would be good to run tests with more "real-live-
examples" -
and those for a longer period of time. If you have the resources
to set
up a cluster with two physical machines and define resource groups
with - well why not - all possible resources (nfs, samba, drbd, ...)
please do so.
Yes, long-term tests (ie, how do we perform in the _stable_ case?) is
something annoyingly difficult to test with a test harness. Or rather,
it's perfectly possible, just very boring.
In one way I want to get on the "we can't test everything" box though:
That's why your testing is so valuable to us. We'll never be able to
test what _you_ want, but we will be happy to fix the bugs you find...
The only thing is possible reason (from the CRM-side) to delay a
release is if we can find a root to Peter's CIB problems.
yes, please. From my own experience I rather prefer not to consider
problems I cannot reproduce, sure. And I don't expect you to take
responsibility for Resource Agents not written by you.
Granted they're doing some advanced things but nothing unreasonable.
Certainly not something that should result in the sort of CPU usage
you're seeing.
Rest assured I'm not sticking my head in the sand - I'm still trying
to figure out how we can get enough data on the problem to solve it.
But believe
me, on _every_ installation we made so far we had the same problem -
that is, lrmd reporting a timeout on one resource agent and heartbeat
was not able to recover which is ... well - bad.
So far I have tracked down the problem to one of the crm_attribute
calls taking too much time at one point.
Hm. How annoying, but I'll leave that to Andrew to debug I'm afraid,
he's the guru of that code.
A regression test which just pounds the CIB with queries from several
clients in parallel however seems a good idea. Andrew, if you're
bored,
how about such a testcase? (We could add it to BSC, or at least run it
on demand there.)
Except it takes 24 hours of such pounding to trigger it... not really
feasible for CTS.
But I have a cluster thats been running for a few days and apart from
a rather nasty memory leak in the stonithd its going well. So with
the CTS cluster running so smoothly it might be time to mix it up a
bit and pound the "aged" one with CIB queries.
As I'm not a coder, it's not easy for me to understand the details
of heartbeat, but I'm willing to, and going to help make heartbeat
_the_ opensource HA software available.
Great!
Well and if you're using this for commercial deployments with the
intention of making money, I of course have to pitch using SLES
because
that keeps us paid, but if you invest a modest amount of money into
some
student's diploma thesis or practica on using/developing Linux HA,
you'll make her/him, yourself and the Linux HA project very happy ;-)
Sincerely,
Lars Marowsky-Brée
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
--
Andrew Beekhof
"Would the last person to leave please turn out the enlightenment?" -
TISM
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/