Hi all,
this is to move forward a discussion Alan, Andrew, Lars (Ellenberg) and
myself have had in the past numerous times.
For those not yet in the know ;-), a RISB (hey, one can actually
pronounce that, it must be a good acronym!) occurs when heartbeat can
communicate with a different set of nodes than a resource active on
several nodes. This can essentially occur for anything not using our
communication channels itself.
For example, a replication mechanism like drbd might lose its internal
connection - causing the replication to stop -, while heartbeat still
continues. Or an OCFS2 instance might be unable to reach some of its
nodes.
The question then is how to escalate this to us, and then what we do
about that - in case of the replication, we need to tell one side to
continue and the other side to die until replication has been restored
(or else risk losing transactions), and in case of OCFS2, also elect one
side to continue and stop the other one.
We've discussed several schemes in the past.
At BrainShare, Alan brought up a very interesting one - a resource not
able to talk to some other node "blacklists" that node at the
CCM/heartbeat level. The internal split-brain is then escalated to a
global split-brain, our regular arbitation & quorum mechanism kicks in,
and all will resolve itself correctly.
Actually, after having thought about this for some time, most of my
initial gut reaction went away - yes, this would work correctly, with
some tweaks. I still have some concerns:
The downside is that indeed it would affect _everything_ in the cluster,
even unrelated services. It also won't always choose the best path to
continue - in the case of drbd, one would want the active side to
continue operating and blacklist the secondary, I think, which this
scheme doesn't guarantee. Alas, that could probably be finetuned with
some tweaks.
It would also cause rather brutal recovery - STONITH and all that -,
while we'd in theory still be able to initiate a proper and clean stop.
(ie, call a to-be-defined "fence" operation on the side which we deem to
be the loser.) Which would also be much easier on the other services.
Also, RAs are commonly pinged every 10-30 seconds only, while heartbeats
happen every second or so - I'm not sure whether this can be done w/o
hysteresis / a barrier (all instances asked to re-verify), or else run a
"too high" risk that we elect the wrong side to continue, simply because
the one failing node has noticed the problem first, and then the "real"
majority loses...
I'm not quite clear yet as to how to lift the ban automatically
either. Because we've essentially blocked it at a fairly low-level (at
least the CCM level), we can't actually communicate with the node after
it has rebooted, and thus won't ever start the resources in a mode that
the others could try talking to it again. Hrm. Maybe this is a
fundamental problem and requires admin intervention...
The upside is that it is a whole lot better then what we do right now,
or so I think, and reasonably simple to implement; the logic for
handling split-brains and computing the winning partition is already in
the CCM. All we'd need is a simple way for the RA to feed this
information back to the CCM - the monitor operation could return the
list of unreachable nodes on stdout or so...
Comments please?
Sincerely,
Lars Marowsky-Brée
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/