[ha-clusters-discuss] Not possible to remove a node booted in non-cluster cluster from a metaset with mediators ?!?

Nils Goroll Sat, 17 Oct 2009 18:27:18 +0200

Hi Thorsten and all,

I would like to come back to this topic once again.


First of all, it seems like one open question is answered:

>> root at pub2-node1:~# time metaset -s pub2-node0 -d -f -h pub2-node0
>> Proxy command to: pub2-node0
>> 172.16.4.1: RPC: Rpcbind failure - RPC: Timed out
>> rpc failure
>> real    1m0.110s
>> user    0m0.068s
>> sys     0m0.026s
> 
> The RPC timeout is to be expected. What I find strange is that it prints 
> 172.16.4.1 - since I would expect "pub2-node0" at this point.

The tests I ran suggest that the name resolution issue at this point will be 
avoided by inserting/appending

        [SUCCESS=return]

after "cluster files" in the "hosts:" line of /etc/nsswitch.conf, as is 
recommended for HA-NFS. I've always got resolved names in error messages since 
I 
added the option.

This suggest that maybe adding the above to nsswitch.conf should be made a 
general recommendation for SVM based clusters, not just for HA-NFS.

Second issue:

The good news is that for a configuration *without* mediators, removing a node 
still works fine as reported in my last posting (it takes 7m15, which I think 
leaves room for improvement, but still works).

The bad news is that, unfortunately, I have repeated the tests with mediators 
without success (or at least without success within a reasonable timeframe). I 
did two tests where

        metaset -s set -df -h node

would not return within 30 and 20 minutes, respectively.

I would appreciate if anyone else could run the following test:

- create a two-string two-node metaset with mediators in a two-node cluster

- make sure /etc/nsswitch.conf contains

        hosts:  cluster files [SUCCESS=return] ## optionally other nameservices

- make sure the set is working (switching between nodes works)

- boot one node into non-cluster mode (boot -x), leaving all network interfaces 
enabled (having connectivity to the surviving node over the public net)

- make sure the diskset is taken on the surviving node

- run

        time metaset -s set -df -h node

   on the surviving node. And please let me know the result (A script of the 
session would be ideal).

Interestingly, this is not reproducible (iow removal works) if the node to be 
removed is down:

pub0-node1:/# ping pub0-node0
no answer from pub0-node0
pub0-node1:/# date ; time metaset -s shared-dg -df -h pub0-node0 ; date
Sat Oct 17 17:39:21 CEST 2009
Oct 17 17:39:43 pub0-node1 genunix: NOTICE: clcomm: Path pub0-node1:xnf3 - 
pub0-node0:xnf3 errors during initiation
Oct 17 17:39:43 pub0-node1 genunix: WARNING: Path pub0-node1:xnf3 - 
pub0-node0:xnf3 initiation encountered errors, errno = 62. Remote node may be 
down or unreachable through this path.
Oct 17 17:39:43 pub0-node1 genunix: NOTICE: clcomm: Path pub0-node1:xnf2 - 
pub0-node0:xnf2 errors during initiation
Oct 17 17:39:43 pub0-node1 genunix: WARNING: Path pub0-node1:xnf2 - 
pub0-node0:xnf2 initiation encountered errors, errno = 62. Remote node may be 
down or unreachable through this path.
Oct 17 17:44:10 pub0-node1 genunix: NOTICE: CMM: Node pub0-node0 (nodeid = 1) 
is 
dead.
Oct 17 17:44:10 pub0-node1 genunix: NOTICE: CMM: Quorum device 
/dev/did/rdsk/d4s2: owner set to node 2.
Oct 17 17:44:32 pub0-node1 md: WARNING: rpc.metamedd on host pub0-node0 not 
responding
pub0-node0: RPC: Rpcbind failure - RPC: Timed out
Oct 17 17:44:32 pub0-node1 last message repeated 1 time

real    8m16.808s
user    0m0.259s
sys     0m0.159s
Sat Oct 17 17:47:38 CEST 2009

It looks like the various timeouts involved in the process could be improved.

As always, I'll very much appreciate any help, pointers and advice.

Thank you, Nils

[ha-clusters-discuss] Not possible to remove a node booted in non-cluster cluster from a metaset with mediators ?!?

Reply via email to