On Fri, Feb 24, 2012 at 10:26:42AM +0100, Andreas Kurz wrote: >On Thu, Feb 23, 2012 at 08:59:44AM -0500, Tom Hanstra wrote: >>On Wed, Feb 22, 2012 at 11:48:17PM +0100, Andreas Kurz wrote: >>> > But is corosync better than heartbeat? Or am I getting into a religious >>> > war by asking that? >>> >>> Since heartbeat is not actively developed any more, corosync is the way >>> to go for a future proof setup.
>> Hmmm, this is something which I did not understand when starting to look >> into this. If this is the case, it would be nice if the web pages were >> updated accordingly. > You mean linux-ha.com? ... yeah that might be true. But looking at > clusterlabs.org makes it quite clear, that corosync is the way to go for > new setups ... there are also some nice faqs: > > http://clusterlabs.org/wiki/FAQ My current take on heartbeat vs corosync, even with the heartbeat "steward and maintainer" hat on: What is relevant for Pacemaker clusters: Heartbeat * Heartbeat is no longer actively developed, and it does not look like that will change. * Heartbeat *is* maintained, and will stay maintained for the foreseeable future. * please use 3.0.5 (current mercurial) or you may get funny "busy loops" if you experience packet loss on the link. * is limited in message size; the cib can grow quite large. Once even the bz2 compressed cib (inclusive status section) grows beyond the payload of a single UDP packet, your Pacemaker on Heartbeat will break horribly. - That could be overcome, but that would mean development. And someone would need to do that. There would need to be a good motivation to do that... I don't see it happen soon, or at all. * I know of a glib callback priority inversion, where heartbeat, in presence of packet loss, may not recognize a "node dead" event, because it is too busy requesting retransmits of the last lost packets... I think I have that fixed, but it is not yet in the repo. Should be "soon", and released as 3.0.6 * It has a strange behaviour if you ifconfig down an interface, then ifconfig up it again (I'm not talking about unplugging or switch down or anything, but really about setting the link as down in linux). It may take ~20 minutes to be able to really use that interface again. I know why that is, I may fix that too, but the short story is "don't do that". * the membership algorithm (cluster consensus membership) is somewhat "ad hoc-ish", but very robust. * Pacemaker on Heartbeat handles "cluster partition merge" as good as it gets, or "as expected". * Heartbeat supports TIPC and other "exotic" protocols, for those interested to run pacemaker on a TIPC stack. Corosync * Corosync is actively developed, and in some cases can be a "moving target". * has improved *a lot* in stability and features since 1.2/1.3. Current 1.4 is good. * The algorithm used is well understood and documented. I think it is much more sensitive to latency, packet loss or timeouts, so you better make sure your network matches the requirements even under heavy load and memory pressure. And configure your timeouts on the conservative side. * 2.0 will of course bring "all new and improved bugs", that is the nature of development. But given all the legs behind it, whatever issues may crop up, they will be fixed very quickly. * does not suffer the single UDP message size limit, I think the message size limit is ~1 MByte (vs <= 64k in heartbeat) * is required for DLM/cLVM/cluster file systems and the like * At this time, the only "very ugly behaviour" I know of with Pacemaker on Corosync is on "cluster partition merge": If nodes are declared dead, not fenced (because of no fencing enabled, or fencing not working, or not working fast enough), and then see each other again, corosync and pacemaker do not agree on membership, and the cluster does not recover. There is also at least one bugzilla on that: https://bugzilla.redhat.com/show_bug.cgi?id=752477 It is unclear (to me) if this is a shortcoming in the implementation of cluster partition merge in corosync, a bug in the pacemaker <-> corosync interaction, or both. I've seen a few commits in pacemaker lately that may be related. Pacemaker on heartbeat behaves as expected in the same situation. As long as you have tested and working fencing implemented, that should not affect you at all. Portability: I don't know about portability of corosync, I know that heartbeat (used to be) portable to just about any unix-like thing out there. That's probably only relevant to very few people, though. My conclusion: Those building small clusters (handful of resources, small number of nodes), no cluster file system, not DLM involved, and trying to get away without stonith: please use heartbeat. Unless the mentioned behaviour is fixed meanwhile... Everyone else, **with tested and working stonith**, for new deployments: use corosync >= 1.4.2 -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
