Re: [Linux-ha-dev] Tracking 2.0.3 release

Andrew Beekhof Fri, 20 Jan 2006 01:04:12 -0800


On Jan 19, 2006, at 9:57 PM, Lars Marowsky-Bree wrote:

On 2006-01-19T20:17:58, Peter Kruse <[EMAIL PROTECTED]> wrote:

The problem I observe only manifested after the resources have been
online for about 24h with one or two resource groups with some
resources defined in it.  So I'm not sure that the tests you run

really are "real live" so to say. The resource agents really putsome

stress on the cib as they run crm_attribute on every monitor action,
that's about 10 RAs calling crm_attribute every 30 seconds.


Woah, what are you calling crm_attribute for all the time?

Its either an ipfail replacement or his way of getting resources torun on the node where they've failed the least... I forget which.

This results in the message
"Processing cib_query operation from ..." occuring in syslog aboutevery
second.
Yeah, we know, logging needs tuning. This one probbably needs to be
tuned down.


Nod.  Not logging read-only CIB calls wouldn't affect me too much.

I have two installations running CVS revision from 18.1.2005 running
until now without problems - knock on wood.


2005-01-18? Woah. We should have just not done the last year then ;-)

or half-full: look where we are and its still only Jan 2005! SLES10is ages away!

Please let me stress this further: Your tests are important tosee ifyour code is reliable. Unfortunately they don't seem to beenough. Idon't want to get into the discussion of "you cannot testeverything".
That is granted.
No, you're of course right, and we need to improve our testing effort
all the time. We'll do a lot more testing as we gear up for shipping
SLES10, for example, that includes more reallife testing by humans.
We try to add testcases for bugs which were fixed, so that we cancatch
regressions, or similar bugs in other pieces of code.
Automated tests can never be a replacement for pilot deploymentsthough;
they feed on them, but can't replace them ;-)

Agreed. CTS is great for ensuring a good level of reliability beforewe put it in anyone's hands. It exercises the cluster in ways andwith frequency we just cant match by hand. But there is no silverbullet and we need to remember that.

But it seems it would be good to run tests with more "real-live-examples" -and those for a longer period of time. If you have the resourcesto set
up a cluster with two physical machines and define resource groups
with - well why not - all possible resources (nfs, samba, drbd, ...)
please do so.


Yes, long-term tests (ie, how do we perform in the _stable_ case?) is
something annoyingly difficult to test with a test harness. Or rather,
it's perfectly possible, just very boring.

In one way I want to get on the "we can't test everything" box though:
That's why your testing is so valuable to us. We'll never be able to
test what _you_ want, but we will be happy to fix the bugs you find...

The only thing is possible reason (from the CRM-side) to delay a
release is if we can find a root to Peter's CIB problems.

yes, please.  From my own experience I rather prefer not to consider
problems I cannot reproduce, sure.  And I don't expect you to take
responsibility for Resource Agents not written by you.


Granted they're doing some advanced things but nothing unreasonable.

Certainly not something that should result in the sort of CPU usageyou're seeing.

Rest assured I'm not sticking my head in the sand - I'm still tryingto figure out how we can get enough data on the problem to solve it.

But believe
me, on _every_ installation we made so far we had the same problem -
that is, lrmd reporting a timeout on one resource agent and heartbeat
was not able to recover which is ...  well - bad.
So far I have tracked down the problem to one of the crm_attribute
calls taking too much time at one point.


Hm. How annoying, but I'll leave that to Andrew to debug I'm afraid,
he's the guru of that code.

A regression test which just pounds the CIB with queries from several

clients in parallel however seems a good idea. Andrew, if you'rebored,

how about such a testcase? (We could add it to BSC, or at least run it
on demand there.)

Except it takes 24 hours of such pounding to trigger it... not reallyfeasible for CTS.

But I have a cluster thats been running for a few days and apart froma rather nasty memory leak in the stonithd its going well. So withthe CTS cluster running so smoothly it might be time to mix it up abit and pound the "aged" one with CIB queries.

As I'm not a coder, it's not easy for me to understand the details
of heartbeat, but I'm willing to, and going to help make heartbeat
_the_ opensource HA software available.


Great!

Well and if you're using this for commercial deployments with the

intention of making money, I of course have to pitch using SLESbecausethat keeps us paid, but if you invest a modest amount of money intosome

student's diploma thesis or practica on using/developing Linux HA,
you'll make her/him, yourself and the Linux HA project very happy ;-)



Sincerely,
    Lars Marowsky-Brée

--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business     -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


--
Andrew Beekhof

"Would the last person to leave please turn out the enlightenment?" -TISM


_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Tracking 2.0.3 release

Reply via email to