On 4/7/06, Lars Marowsky-Bree <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> this is to move forward a discussion Alan, Andrew, Lars (Ellenberg) and
> myself have had in the past numerous times.
>
> For those not yet in the know ;-), a RISB (hey, one can actually
> pronounce that, it must be a good acronym!) occurs when heartbeat can
> communicate with a different set of nodes than a resource active on
> several nodes. This can essentially occur for anything not using our
> communication channels itself.
>
> For example, a replication mechanism like drbd might lose its internal
> connection - causing the replication to stop -, while heartbeat still
> continues. Or an OCFS2 instance might be unable to reach some of its
> nodes.
>
> The question then is how to escalate this to us, and then what we do
> about that - in case of the replication, we need to tell one side to
> continue and the other side to die until replication has been restored
> (or else risk losing transactions), and in case of OCFS2, also elect one
> side to continue and stop the other one.
>
> We've discussed several schemes in the past.
>
> At BrainShare, Alan brought up a very interesting one - a resource not
> able to talk to some other node "blacklists" that node at the
> CCM/heartbeat level. The internal split-brain is then escalated to a
> global split-brain, our regular arbitation & quorum mechanism kicks in,
> and all will resolve itself correctly.
>
> Actually, after having thought about this for some time, most of my
> initial gut reaction went away - yes, this would work correctly, with
> some tweaks. I still have some concerns:
>
> The downside is that indeed it would affect _everything_ in the cluster,
> even unrelated services. It also won't always choose the best path to
> continue - in the case of drbd, one would want the active side to
> continue operating and blacklist the secondary, I think, which this
> scheme doesn't guarantee. Alas, that could probably be finetuned with
> some tweaks.
>
> It would also cause rather brutal recovery - STONITH and all that -,
> while we'd in theory still be able to initiate a proper and clean stop.
> (ie, call a to-be-defined "fence" operation on the side which we deem to
> be the loser.) Which would also be much easier on the other services.
>
> Also, RAs are commonly pinged every 10-30 seconds only, while heartbeats
> happen every second or so - I'm not sure whether this can be done w/o
> hysteresis / a barrier (all instances asked to re-verify), or else run a
> "too high" risk that we elect the wrong side to continue, simply because
> the one failing node has noticed the problem first, and then the "real"
> majority loses...
>
> I'm not quite clear yet as to how to lift the ban automatically
> either. Because we've essentially blocked it at a fairly low-level (at
> least the CCM level), we can't actually communicate with the node after
> it has rebooted, and thus won't ever start the resources in a mode that
> the others could try talking to it again. Hrm. Maybe this is a
> fundamental problem and requires admin intervention...
>
>
> The upside is that it is a whole lot better then what we do right now,
> or so I think, and reasonably simple to implement; the logic for
> handling split-brains and computing the winning partition is already in
> the CCM. All we'd need is a simple way for the RA to feed this
> information back to the CCM - the monitor operation could return the
> list of unreachable nodes on stdout or so...
>
>
> Comments please?

The recovery seemed a bit harsh, however we already move all healthy
resources away before fencing a node so it isn't so horrible.

I assume the idea is that until un-banned, the node can't join the cluster?
At least that prevents shooting the node over and over again.
For this reason I think un-banning needs to be a manual process.

We might need to be careful that a bunch of banned nodes aren't able
to form a cluster of their own and start managing resources.

Which leads into the question of how this will work with the various
no_quoum_policies.

I think you have a point about needing a hysteresis... that should be
interesting.
Random thought, maybe attrd (part of the ipfail replacement) could be
useful here.


All in all, I think this is a promising approach.
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to