On Wed, Oct 13, 2010 at 10:42:33PM +0200, Lars Marowsky-Bree wrote: > On 2010-10-13T20:36:59, Steve Davies <[email protected]> wrote: > > > In particular I am interested to understand the meaning of the various > > sequence numbers and so forth, and their implications when hosts > > fail-over, die, return to active status etc. Basically the sort of > > thing you would find in a protocol RFC. > > > > Many thanks for any pointers. > > The heartbeat protocol and the CCM are not extremely well documented. > > For corosync, there's the protocol specification for totem at least. But > I guess most of the things would still be in code ;-) > > Regards, > Lars > > -- > Architect Storage/HA, OPS Engineering, Novell, Inc.
Absolutely. So if you are going to spend time improving the cluster communications code, that time would better be spend understanding and improving corosync. There is enough work to do, (automatic recovery of) redundant rings, membership when starting with all cluster comm down to allow for a two-node tiebreaker and stonith of the other node to make progress, probably a few other interesting higher level issues, and certainly a few not-so-interesting janitor level things. Corosync (or, at least the algorithms it implements) are much better documented, or should we say: documented at all, besides reading the code. Even though I myself spent some time in the heartbeat ipc messaging and cluster communication layer lately, I'd not recommend anyone _starting_ on this to do so. I sometimes have to, as that's part of being the appointed maintainer of the heartbeat stuff. If you have the choice, go understand corosync, and the algorithms involved, and improve it. Steve Dake would be the guy to ask for advice on corosync, I'm sure he won't be opposed to someone helping out with corosync maintenance and development. If you happen to be somehow target locked on heartbeat, tell us why, and what you are trying to achieve, and we figure something out. If you are "just" homing in on cluster communications, please go for corosync. Why should I say so, even if I currently still advocate the use of heartbeat/pacemaker over corosync/pacemaker in production setups? Because heartbeat is legacy. It works (most of the time), but actually no one really knows anymore how exactly, or why, it works. Corosync may not work as good as we would like it to in various scenarios. But at least we know what exactly it tries to do, and why, as the algorithms involved are documented. And thus time improving corosync, identifying and overcoming its limitations, would be time well spent. Whereas time figuring out what exactly heartbeat does, and why it may do it the way it does by reverse engeneering the code, is probably not exactly wasted, but possibly close, sometimes, even though it may be an interesting and educating experience. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
