On 07/13/09 14:19, Sergei Kolodka wrote: >> 2. Node 1 panics >> To node 2, this still looks like a split brain as it >> can not contact >> node1. If the algo is modified to give priority to >> node1, you will have >> a full cluster outage. > > Correct. However if node 2 panics it also loses connection to quorum server. > I'm pretty sure that we can reliably say that if node is not accessible via > both private and public interfaces it is 99.9% dead. Given choice between > 5-15 second delay before failover/crash of node and 15+ minute-long start of > tens of Oracle databases I'd choose 5-15 second delay. In my opinion, it should be something optional (i.e the administrator can configure) because there are many people who want it to be 100% accurate which is what the algorithm is doing now.
In this particular example, the quorum device may not be a Quorum Server (QS), it could be a disk and then things become more trickier. Also if it is a QS, and node1 is hung (not panicked) then the QS will still report that the node1 is active. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20090713/f23ccef9/attachment.html>
