Graham, Simon wrote: >> On 2007-10-19T21:57:17, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: >> >>> http://old.linux- >> foundation.org/developer_bugzilla/show_bug.cgi?id=1732 >>> for some discussion on communication interfaces. >> "discussion" means "the current deficits are by design" ;-) >> > > So right now I'm thinking I need to modify the config and restart > the hb service when changes occur in the NICs... > >>>> This seems somewhat counter to the idea of high availability but >> I'd >>>> like to understand the design center for this behavior before I >> start >>>> trying to 'fix' it... >>> What are your circumstances? In which situations should the >>> interface be down? >> Hotplug interfaces. Transient issues. Weird bugs. >> >> One NIC dead on start-up (which is a valid SPoF scenario which is >> currently not handled). > > Exactly - when you are in a degraded state you still want the cluster to > come up.
And indeed, the cluster does come up - without a node. A more accurate summation is that "a single node in the cluster doesn't come up". So, the _cluster_ does recover from this error. It just does it without that node. So, service is not interrupted. > The specific case that started me looking at this is when there is no > address > set on a link (e.g. if link is down at startup) which causes hb to > simply > refuse to start. It's not "link down". It's "hardware missing". Link down won't keep heartbeat from starting, but missing hardware will certainly do so. So, to correct both of these errors in the description: When the hardware supporting heartbeat communications is missing on startup, then the node on which its missing will refuse to start, resulting in a degraded but operational cluster. Of course, if you do "recover" from the error, you have the same situation - a degraded but operational cluster. In this case, somewhat less degraded than the case above. Here's why it works that way: It is very common for people to make mistakes in configuration. It is impossible to distinguish between a mistake and a broken interface. It is very hard to get people's attention to read logs. Failing to start does a good job of doing that. And, because of those considerations _and_ the complexity of doing otherwise, it does not put any effort into trying to recover from it. Because such code would be very rarely used in practice (like once every 5K-10K cluster years - judging from past experience), the chances of it having undiscovered bugs in it are very great. The current behavior exercises well-tested recovery paths (what to do when a node is down). I don't claim that this is a perfect response, but in terms of initial startup - you really don't want configuration errors to go unnoticed, and you can't tell which case this is. I would guess that in 99+% of the cases it's a misconfiguration rather than a real failure. The other case, of an interface going away, is a case the code _probably_ should recover from that. It is also worth noting that in practice (as opposed to in testing), this has not come up to my knowledge. The only bug in real life I've heard of which exhibited this is one where the system was quite probably misconfigured (using DHCP for cluster interfaces). Keep in mind that a cluster will not stop providing service just because a single node doesn't come up. So, you haven't lost service when this happens, but you get some really nasty messages and "failure to start" usually gets people's attention. I am fully aware that subsequent failures will indeed cause things to fail - but this behavior does not constitute a single point of failure for the cluster. This is the rationale for this behavior. It's not perfect behavior, but it's not completely irrational either... -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/