[ha-clusters-discuss] how to solve this "amnesia" protect problem

roush Mon, 18 May 2009 08:44:35 -0700

Hi Lifeng,

You have described an "amnesia" situation.

The Sun Cluster design ensures that the operational cluster
always has the latest cluster configuration information.
When we cannot guarantee that the cluster nodes attempting
to form a cluster have the latest cluster configuration information,
we make the nodes wait until a node with the latest information
becomes available.

The justification for this design decision was that data integrity
was the top priority.

-------------------

The situation that you describe has been raised in the past as a possibility.
If you have actually seen this scenario happen, please file a bug report.
We have a long list of features that we could work on and only have
limited engineering resources. Bug reports provide important information about
what the importance of a feature to our customers.

There is no supported option or work around at the moment.

In an emergency one could manually adjust the quorum votes of a node
so that it can form a cluster without the other node and also without
the quorum device. Unless you are an expert on the internal workings
of the membership algorithm, I do not recommend that you attempt this
yourself. Be aware that any such manual intervention will require
a second manual intervention to put the cluster back together again
once the other machine is repaired.

If you hit this scenario, recommend that you contact your Sun Support
Service representative for help.

Regards,
Ellard

lf yang wrote:
> Hi All
> 
> I have a two nodes one quorum cluster environment. I just see this problem:
> 
> I shutdown nodeA and all all othe resource groups switch to nodeB, it works 
> fine
> and I believe quorum device vote belongs to nodeB.
> 
> Then I shutdown nodeB and power on the nodeA, nodeA cannot boot as cluster,
> there's messages like:
> 
> NOTICE: clcomm: Path NodeB:e1000g0 - NodeA:e1000g0 errors during initiation
> WARNING: Path NodeB:e1000g0 - NodeA:e1000g0 initiation encountered errors, 
> errno = 62. Remote node may be down or unreachable through this path.
> NOTICE: CMM: Cluster doesn't have operational quorum yet; waiting for quorum.
> 
> I know this is because the quorum vote is nodeB to prevent amnesia condition.
> But I still confused for this: if the problem occurs, nodeB is broken and 
> cannot
> boot anyway,what people can do is just hack the CCR?Is there any solution 
> ,for 
> example, nodeA can boot as cluster master and after nodeB come up ,the DR can
> update the older nodeB CCR ?is there any option for this?
> thanks
> 
> lifeng

[ha-clusters-discuss] how to solve this "amnesia" protect problem

Reply via email to