On 14 October 2010 09:41, Lars Ellenberg <[email protected]> wrote: > On Wed, Oct 13, 2010 at 10:42:33PM +0200, Lars Marowsky-Bree wrote: >> On 2010-10-13T20:36:59, Steve Davies <[email protected]> wrote: >> >> > In particular I am interested to understand the meaning of the various >> > sequence numbers and so forth, and their implications when hosts >> > fail-over, die, return to active status etc. Basically the sort of >> > thing you would find in a protocol RFC. >> > >> > Many thanks for any pointers. >> >> The heartbeat protocol and the CCM are not extremely well documented. >> >> For corosync, there's the protocol specification for totem at least. But >> I guess most of the things would still be in code ;-) >> >> Regards, >> Lars >> >> -- >> Architect Storage/HA, OPS Engineering, Novell, Inc. > > Absolutely. > > So if you are going to spend time improving the cluster communications > code, that time would better be spend understanding and improving > corosync. There is enough work to do, (automatic recovery of) redundant > rings, membership when starting with all cluster comm down to allow for > a two-node tiebreaker and stonith of the other node to make progress, > probably a few other interesting higher level issues, and certainly a > few not-so-interesting janitor level things. > > Corosync (or, at least the algorithms it implements) are much better > documented, or should we say: documented at all, besides reading the code. > > Even though I myself spent some time in the heartbeat ipc messaging and > cluster communication layer lately, I'd not recommend anyone _starting_ > on this to do so. I sometimes have to, as that's part of being the > appointed maintainer of the heartbeat stuff. > > If you have the choice, go understand corosync, > and the algorithms involved, and improve it. > Steve Dake would be the guy to ask for advice on corosync, > I'm sure he won't be opposed to someone helping out with corosync > maintenance and development. > > If you happen to be somehow target locked on heartbeat, tell us why, > and what you are trying to achieve, and we figure something out. > > If you are "just" homing in on cluster communications, > please go for corosync. > > Why should I say so, even if I currently still advocate the use of > heartbeat/pacemaker over corosync/pacemaker in production setups? > > Because heartbeat is legacy. It works (most of the time), but actually > no one really knows anymore how exactly, or why, it works. > > Corosync may not work as good as we would like it to in various > scenarios. But at least we know what exactly it tries to do, and why, > as the algorithms involved are documented. > And thus time improving corosync, identifying and overcoming its > limitations, would be time well spent. > > Whereas time figuring out what exactly heartbeat does, and why it may do > it the way it does by reverse engeneering the code, is probably not > exactly wasted, but possibly close, sometimes, even though it may be an > interesting and educating experience. >
Hi, I am somewhat committed to heartbeat, because I cannot use pacemaker - pacemaker is "huge", not necessarily in its own right, but even a minimal Python install is far too big for the 300Mb O/S storage I have available (Yes, 300Megabytes!) I assume (I've not checked) that unlike heartbeat, corosync does not have a built-in "minimal resource manager", so that I can run a basic 2-node setup without pacemaker? The heartbeat code is fairly well self-documenting, but it is sometimes frustrating having to go search for the document in code when I was initially unaware of the cluster-glue / heartbeat code split :) I'll fight on. At least I now know where to submit any changes if I manage to do anything useful :) Thanks, Steve _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
