Hi Nils et al, see inline:
Nils Goroll wrote: > Hi Thorsten, > > >>>>> ## node0 booted outside cluster (-x) >>>> Why are you booting the node out of the cluster? >>> >>> I am trying to work out a procedure to restore a failed cluster node >>> on different hardware, in which case I cannot assume that the >>> interconnect will come up as the CLI interfaces might have changed. >> >> Now I am confused. So let me add some more context and see if this is >> what you are doing. >> >> The starting point is a working two node cluster (lets call them >> node-a and node-b). >> >> A diskset gets configured for both nodes. >> >> One node fails and is no longer available. Lets assume this is node-b. >> >> You should still be able to boot node-a in cluster mode > > Correct. > > > If you then determine node-b to be non repairable/restorable, you should > > be able to remove node-b from the diskset by using: > > > > root at node-a# metaset -s <disksetname> -df -h node-b > > which is exactly what I am trying to do. In the case I posted, the > failed node is node0 and I am trying to run on node1 (booted in cluster): > > root at pub2-node1:~# time metaset -s pub2-node0 -d -f -h pub2-node0 > Proxy command to: pub2-node0 > 172.16.4.1: RPC: Rpcbind failure - RPC: Timed out > rpc failure > real 1m0.110s > user 0m0.068s > sys 0m0.026s The RPC timeout is to be expected. What I find strange is that it prints 172.16.4.1 - since I would expect "pub2-node0" at this point. By random I have a cluster in my lab that has indeed a failed node (the Ultra 10 motherboard died). Granted, that is still a S10 11/06 and SC 3.2 FCS cluster, but I was able to perform the following (lab-u10-1 is the still active node, lab-u10-2 is the real dead node): root at lab-u10-1 # metaset [...] Set name = sge_ds, Set number = 2 Host Owner lab-u10-1 Yes lab-u10-2 Mediator Host(s) Aliases lab-u10-1 lab-u10-2 Driv Dbase d3 Yes [...] root at lab-u10-1 # metaset -s sge_ds -df -h lab-u10-2 Oct 14 16:05:05 lab-u10-1 md: WARNING: rpc.metamedd on host lab-u10-2 not responding lab-u10-2: RPC: Rpcbind failure - RPC: Timed out root at lab-u10-1 # metaset [...] Set name = sge_ds, Set number = 2 Host Owner lab-u10-1 Yes Mediator Host(s) Aliases lab-u10-1 lab-u10-2 Driv Dbase d3 Yes Thus as you can see, the RPC failure is to be expected, since the cluster will try to contact the other node, and will fail since it is dead. But it will ultimately remove the node from the diskset. Note that the diskset must be online on the remaining node (here lab-u10-1). Now the obvious difference to your output is that in mine the correct nodename is getting used. As I remember, SVM is very picky about that host name - it must be exactly the one as the diskset got registered. Your output shows the host names pub2-node0 and pub2-node1, but the RPC error is for 172.16.4.1 - here I would have expected "pub2-node0" instead. The IP sounds like the cluster interconnect. Is it possible that your host resolving / IP resolving is somehow broken? > root at pub2-node1:~# metaset > > Set name = nfs-dg, Set number = 1 > > Host Owner > pub2-node0 > pub2-node1 Yes > > Driv Dbase > > d1 Yes > > d2 Yes > > I've tried the same on s10 with sc32 and didn't succeed either. Was the correct host name displayed on that cluster? > Regarding on side aspect: > >> Thus I am not sure what kind of interconnect or CLI interface issues >> you expect. > > I was referring to the fact that in the recovery scenario I am trying to > solve it might not be possible to form a cluster because the failed node > possibly could get restored on different hardware, so the (restored) > cluster config would still contain adapters which don't exist on the > (changed) hardware. In that case you should not try to add that node as the same name as the failed one. Instead remove the failed one, and add the restored hardware as a complete new node. >> I would assume that you need to remove the node from other things like >> resource groups, quorum device, etc, before you actually perform the >> "clnode clear -F node-b" from node-a (again being in cluster mode). >> >> "clnode remove" would only be used if the node you want to remove is >> still bootable into non-cluster mode. > > Please let me come back to this point later, I currently can't access my > development environment :-( > >> Or are you trying to remove a dead node, and then later add a >> different new node? > > This is the plan. For purpose of development environment, the removed > and added node are the same, but this is just a simplification. As recommended above - unless you have the exact same hardware and are able to restore from a backup the exact state of the failed node, I would recommend to remove the failed node and add the restored one as new node to the cluster. Regards Thorsten -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin Haering ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~