Hi Thorsten and all, I would like to come back to this topic once again.
First of all, it seems like one open question is answered: >> root at pub2-node1:~# time metaset -s pub2-node0 -d -f -h pub2-node0 >> Proxy command to: pub2-node0 >> 172.16.4.1: RPC: Rpcbind failure - RPC: Timed out >> rpc failure >> real 1m0.110s >> user 0m0.068s >> sys 0m0.026s > > The RPC timeout is to be expected. What I find strange is that it prints > 172.16.4.1 - since I would expect "pub2-node0" at this point. The tests I ran suggest that the name resolution issue at this point will be avoided by inserting/appending [SUCCESS=return] after "cluster files" in the "hosts:" line of /etc/nsswitch.conf, as is recommended for HA-NFS. I've always got resolved names in error messages since I added the option. This suggest that maybe adding the above to nsswitch.conf should be made a general recommendation for SVM based clusters, not just for HA-NFS. Second issue: The good news is that for a configuration *without* mediators, removing a node still works fine as reported in my last posting (it takes 7m15, which I think leaves room for improvement, but still works). The bad news is that, unfortunately, I have repeated the tests with mediators without success (or at least without success within a reasonable timeframe). I did two tests where metaset -s set -df -h node would not return within 30 and 20 minutes, respectively. I would appreciate if anyone else could run the following test: - create a two-string two-node metaset with mediators in a two-node cluster - make sure /etc/nsswitch.conf contains hosts: cluster files [SUCCESS=return] ## optionally other nameservices - make sure the set is working (switching between nodes works) - boot one node into non-cluster mode (boot -x), leaving all network interfaces enabled (having connectivity to the surviving node over the public net) - make sure the diskset is taken on the surviving node - run time metaset -s set -df -h node on the surviving node. And please let me know the result (A script of the session would be ideal). Interestingly, this is not reproducible (iow removal works) if the node to be removed is down: pub0-node1:/# ping pub0-node0 no answer from pub0-node0 pub0-node1:/# date ; time metaset -s shared-dg -df -h pub0-node0 ; date Sat Oct 17 17:39:21 CEST 2009 Oct 17 17:39:43 pub0-node1 genunix: NOTICE: clcomm: Path pub0-node1:xnf3 - pub0-node0:xnf3 errors during initiation Oct 17 17:39:43 pub0-node1 genunix: WARNING: Path pub0-node1:xnf3 - pub0-node0:xnf3 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path. Oct 17 17:39:43 pub0-node1 genunix: NOTICE: clcomm: Path pub0-node1:xnf2 - pub0-node0:xnf2 errors during initiation Oct 17 17:39:43 pub0-node1 genunix: WARNING: Path pub0-node1:xnf2 - pub0-node0:xnf2 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path. Oct 17 17:44:10 pub0-node1 genunix: NOTICE: CMM: Node pub0-node0 (nodeid = 1) is dead. Oct 17 17:44:10 pub0-node1 genunix: NOTICE: CMM: Quorum device /dev/did/rdsk/d4s2: owner set to node 2. Oct 17 17:44:32 pub0-node1 md: WARNING: rpc.metamedd on host pub0-node0 not responding pub0-node0: RPC: Rpcbind failure - RPC: Timed out Oct 17 17:44:32 pub0-node1 last message repeated 1 time real 8m16.808s user 0m0.259s sys 0m0.159s Sat Oct 17 17:47:38 CEST 2009 It looks like the various timeouts involved in the process could be improved. As always, I'll very much appreciate any help, pointers and advice. Thank you, Nils