On 4/7/06, Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote: > Hi all, > > this is to move forward a discussion Alan, Andrew, Lars (Ellenberg) and > myself have had in the past numerous times. > > For those not yet in the know ;-), a RISB (hey, one can actually > pronounce that, it must be a good acronym!) occurs when heartbeat can > communicate with a different set of nodes than a resource active on > several nodes. This can essentially occur for anything not using our > communication channels itself. > > For example, a replication mechanism like drbd might lose its internal > connection - causing the replication to stop -, while heartbeat still > continues. Or an OCFS2 instance might be unable to reach some of its > nodes. > > The question then is how to escalate this to us, and then what we do > about that - in case of the replication, we need to tell one side to > continue and the other side to die until replication has been restored > (or else risk losing transactions), and in case of OCFS2, also elect one > side to continue and stop the other one. > > We've discussed several schemes in the past. > > At BrainShare, Alan brought up a very interesting one - a resource not > able to talk to some other node "blacklists" that node at the > CCM/heartbeat level. The internal split-brain is then escalated to a > global split-brain, our regular arbitation & quorum mechanism kicks in, > and all will resolve itself correctly. > > Actually, after having thought about this for some time, most of my > initial gut reaction went away - yes, this would work correctly, with > some tweaks. I still have some concerns: > > The downside is that indeed it would affect _everything_ in the cluster, > even unrelated services. It also won't always choose the best path to > continue - in the case of drbd, one would want the active side to > continue operating and blacklist the secondary, I think, which this > scheme doesn't guarantee. Alas, that could probably be finetuned with > some tweaks. > > It would also cause rather brutal recovery - STONITH and all that -, > while we'd in theory still be able to initiate a proper and clean stop. > (ie, call a to-be-defined "fence" operation on the side which we deem to > be the loser.) Which would also be much easier on the other services. > > Also, RAs are commonly pinged every 10-30 seconds only, while heartbeats > happen every second or so - I'm not sure whether this can be done w/o > hysteresis / a barrier (all instances asked to re-verify), or else run a > "too high" risk that we elect the wrong side to continue, simply because > the one failing node has noticed the problem first, and then the "real" > majority loses... > > I'm not quite clear yet as to how to lift the ban automatically > either. Because we've essentially blocked it at a fairly low-level (at > least the CCM level), we can't actually communicate with the node after > it has rebooted, and thus won't ever start the resources in a mode that > the others could try talking to it again. Hrm. Maybe this is a > fundamental problem and requires admin intervention... > > > The upside is that it is a whole lot better then what we do right now, > or so I think, and reasonably simple to implement; the logic for > handling split-brains and computing the winning partition is already in > the CCM. All we'd need is a simple way for the RA to feed this > information back to the CCM - the monitor operation could return the > list of unreachable nodes on stdout or so... > > > Comments please?
The recovery seemed a bit harsh, however we already move all healthy resources away before fencing a node so it isn't so horrible. I assume the idea is that until un-banned, the node can't join the cluster? At least that prevents shooting the node over and over again. For this reason I think un-banning needs to be a manual process. We might need to be careful that a bunch of banned nodes aren't able to form a cluster of their own and start managing resources. Which leads into the question of how this will work with the various no_quoum_policies. I think you have a point about needing a hysteresis... that should be interesting. Random thought, maybe attrd (part of the ipfail replacement) could be useful here. All in all, I think this is a promising approach. _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
