Hi Simon, yes I ran mmsdrrestore -p <working node in the cluster>
and that helped to create the /var/mmfs/ccr directory which was missing. But it didn't create a ccr.nodes file, so I ended up scp'ng that over by hand which I hope was the right thing to do. The one host that is no longer in service is still in that ccr.nodes file and when I try to mmdelnode it I get: root@ocio-gpu03 renata]# mmdelnode -N dhcp-os-129-164.slac.stanford.edu mmdelnode: Unable to obtain the GPFS configuration file lock. mmdelnode: GPFS was unable to obtain a lock from node dhcp-os-129-164.slac.stanford.edu. mmdelnode: Command failed. Examine previous error messages to determine cause. despite the fact that it doesn't respond to ping. The mmstartup on the newly reinstalled node fails as in my initial email. I should mention that the two "working" nodes are running 4.2.3.4. The person who reinstalled the node that won't start up put on 4.2.3.8. I didn't think that was the cause of this problem though and thought I would try to get the cluster talking again before upgrading the rest of the nodes or degrading the reinstalled one. Thanks, Renata On Wed, 27 Jun 2018, Simon Thompson wrote: >Have you tried running mmsdrestore in the reinstalled node to reads to the >cluster and then try and startup gpfs on it? > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1pdg_mmsdrrest.htm > >Simon >________________________________________ >From: gpfsug-discuss-boun...@spectrumscale.org >[gpfsug-discuss-boun...@spectrumscale.org] on behalf of Renata Maria Dart >[ren...@slac.stanford.edu] >Sent: 27 June 2018 19:09 >To: gpfsug-discuss@spectrumscale.org >Subject: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues > >Hi, we have a client cluster of 4 nodes with 3 quorum nodes. One of the >quorum nodes is no longer in service and the other was reinstalled with >a newer OS, both without informing the gpfs admins. Gpfs is still >"working" on the two remaining nodes, that is, they continue to have access >to the gpfs data on the remote clusters. But, I can no longer get >any gpfs commands to work. On one of the 2 nodes that are still serving data, > >root@ocio-gpu01 ~]# mmlscluster >get file failed: Not enough CCR quorum nodes available (err 809) >gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 >mmlscluster: Command failed. Examine previous error messages to determine >cause. > > >On the reinstalled node, this fails in the same way: > >[root@ocio-gpu02 ccr]# mmstartup >get file failed: Not enough CCR quorum nodes available (err 809) >gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158 >mmstartup: Command failed. Examine previous error messages to determine cause. > > >I have looked through the users group interchanges but didn't find anything >that seems to fit this scenario. > >Is there a way to salvage this cluster? Can it be done without >shutting gpfs down on the 2 nodes that continue to work? > >Thanks for any advice, > >Renata Dart >SLAC National Accelerator Lb > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss