Re: [gpfsug-discuss] CCR cluster down for the count?

Buterbaugh, Kevin L Wed, 20 Sep 2017 06:56:53 -0700

Hi All,

testnsd1 and testnsd3 both had hardware issues (power supply and internal HD 
respectively).  Given that they were 12 year old boxes, we decided to replace 
them with other boxes that are a mere 7 years old … keep in mind that this is a 
test cluster.


Disabling CCR does not work, even with the undocumented “—force” option:

/var/mmfs/gen
root@testnsd2# mmchcluster --ccr-disable -p testnsd2 -s testnsd1 --force
mmchcluster: Unable to obtain the GPFS configuration file lock.
mmchcluster: GPFS was unable to obtain a lock from node testnsd1.vampire.
mmchcluster: Processing continues without lock protection.
The authenticity of host 'testnsd3.vampire (10.0.6.215)' can't be established.
ECDSA key fingerprint is SHA256:Ky1pkjsC/kvt4RA8PJuEh/W3vcxCJZplr2m1XHr+UwI.
ECDSA key fingerprint is MD5:55:59:a0:2a:6e:a1:00:58:85:3d:ac:86:0e:cd:2a:8a.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'testnsd1.vampire (10.0.6.213)' can't be established.
ECDSA key fingerprint is SHA256:WPiTtyuyzhuv+lRRpgDjLuHpyHyk/W3+c5N9SabWvnE.
ECDSA key fingerprint is MD5:26:26:2a:bf:e4:cb:1d:a8:27:35:96:ef:b5:96:e0:29.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'vmp609.vampire (10.0.21.9)' can't be established.
ECDSA key fingerprint is SHA256:/gX6eSp/shsRboVFcUFcNCtGSfbBIWQZ/CWjA6gb17Q.
ECDSA key fingerprint is MD5:ca:4d:58:8c:91:28:25:7b:5b:b1:0d:a3:72:a3:00:bb.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'vmp608.vampire (10.0.21.8)' can't be established.
ECDSA key fingerprint is SHA256:tvtNWN9b7/Qknb/Am8x7FzyMngi6R3f5SHBqATNtLzw.
ECDSA key fingerprint is MD5:fc:4e:87:fb:09:82:cd:67:b0:7d:7f:c7:4b:83:b9:6c.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'vmp612.vampire (10.0.21.12)' can't be established.
ECDSA key fingerprint is SHA256:zKXqPt8rIMZWSAYavKEuaAVIm31OGVovoWVU+dBTRPM.
ECDSA key fingerprint is MD5:72:4d:fb:22:4e:b3:0e:04:37:be:16:74:ae:ea:05:6c.
Are you sure you want to continue connecting (yes/no)? 
[email protected]<mailto:[email protected]>'s password: testnsd3.vampire:  
Host key verification failed.
mmdsh: testnsd3.vampire remote shell process had return code 255.
testnsd1.vampire:  Host key verification failed.
mmdsh: testnsd1.vampire remote shell process had return code 255.
vmp609.vampire:  Host key verification failed.
mmdsh: vmp609.vampire remote shell process had return code 255.
vmp608.vampire:  Host key verification failed.
mmdsh: vmp608.vampire remote shell process had return code 255.
vmp612.vampire:  Host key verification failed.
mmdsh: vmp612.vampire remote shell process had return code 255.

[email protected]<mailto:[email protected]>'s password: vmp610.vampire:  
Permission denied, please try again.

[email protected]<mailto:[email protected]>'s password: vmp610.vampire:  
Permission denied, please try again.

vmp610.vampire:  Permission denied 
(publickey,gssapi-keyex,gssapi-with-mic,password).
mmdsh: vmp610.vampire remote shell process had return code 255.

Verifying GPFS is stopped on all nodes ...
The authenticity of host 'testnsd3.vampire (10.0.6.215)' can't be established.
ECDSA key fingerprint is SHA256:Ky1pkjsC/kvt4RA8PJuEh/W3vcxCJZplr2m1XHr+UwI.
ECDSA key fingerprint is MD5:55:59:a0:2a:6e:a1:00:58:85:3d:ac:86:0e:cd:2a:8a.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'vmp612.vampire (10.0.21.12)' can't be established.
ECDSA key fingerprint is SHA256:zKXqPt8rIMZWSAYavKEuaAVIm31OGVovoWVU+dBTRPM.
ECDSA key fingerprint is MD5:72:4d:fb:22:4e:b3:0e:04:37:be:16:74:ae:ea:05:6c.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'vmp608.vampire (10.0.21.8)' can't be established.
ECDSA key fingerprint is SHA256:tvtNWN9b7/Qknb/Am8x7FzyMngi6R3f5SHBqATNtLzw.
ECDSA key fingerprint is MD5:fc:4e:87:fb:09:82:cd:67:b0:7d:7f:c7:4b:83:b9:6c.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'vmp609.vampire (10.0.21.9)' can't be established.
ECDSA key fingerprint is SHA256:/gX6eSp/shsRboVFcUFcNCtGSfbBIWQZ/CWjA6gb17Q.
ECDSA key fingerprint is MD5:ca:4d:58:8c:91:28:25:7b:5b:b1:0d:a3:72:a3:00:bb.
Are you sure you want to continue connecting (yes/no)? The authenticity of host 
'testnsd1.vampire (10.0.6.213)' can't be established.
ECDSA key fingerprint is SHA256:WPiTtyuyzhuv+lRRpgDjLuHpyHyk/W3+c5N9SabWvnE.
ECDSA key fingerprint is MD5:26:26:2a:bf:e4:cb:1d:a8:27:35:96:ef:b5:96:e0:29.
Are you sure you want to continue connecting (yes/no)? 
[email protected]<mailto:[email protected]>'s password:
[email protected]<mailto:[email protected]>'s password:
[email protected]<mailto:[email protected]>'s password:

testnsd3.vampire:  Host key verification failed.
mmdsh: testnsd3.vampire remote shell process had return code 255.
vmp612.vampire:  Host key verification failed.
mmdsh: vmp612.vampire remote shell process had return code 255.
vmp608.vampire:  Host key verification failed.
mmdsh: vmp608.vampire remote shell process had return code 255.
vmp609.vampire:  Host key verification failed.
mmdsh: vmp609.vampire remote shell process had return code 255.
testnsd1.vampire:  Host key verification failed.
mmdsh: testnsd1.vampire remote shell process had return code 255.
vmp610.vampire:  Permission denied, please try again.
vmp610.vampire:  Permission denied, please try again.
vmp610.vampire:  Permission denied 
(publickey,gssapi-keyex,gssapi-with-mic,password).
mmdsh: vmp610.vampire remote shell process had return code 255.
mmchcluster: Command failed. Examine previous error messages to determine cause.
/var/mmfs/gen
root@testnsd2#

I believe that part of the problem may be that there are 4 client nodes that 
were removed from the cluster without removing them from the cluster (done by 
another SysAdmin who was in a hurry to repurpose those machines).  They’re up 
and pingable but not reachable by GPFS anymore, which I’m pretty sure is making 
things worse.

Nor does Loic’s suggestion of running mmcommon work (but thanks for the 
suggestion!) … actually the mmcommon part worked, but a subsequent attempt to 
start the cluster up failed:

/var/mmfs/gen
root@testnsd2# mmstartup -a
get file failed: Not enough CCR quorum nodes available (err 809)
gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
mmstartup: Command failed. Examine previous error messages to determine cause.
/var/mmfs/gen
root@testnsd2#

Thanks.

Kevin

On Sep 19, 2017, at 10:07 PM, IBM Spectrum Scale 
<[email protected]<mailto:[email protected]>> wrote:


Hi Kevin,

Let's me try to understand the problem you have. What's the meaning of node 
died here. Are you mean that there are some hardware/OS issue which cannot be 
fixed and OS cannot be up anymore?

I agree with Bob that you can have a try to disable CCR temporally, restore 
cluster configuration and enable it again.

Such as:

1. Login to a node which has proper GPFS config, e.g NodeA
2. Shutdown daemon in all client cluster.
3. mmchcluster --ccr-disable -p NodeA
4. mmsdrrestore -a -p NodeA
5. mmauth genkey propagate -N testnsd1, testnsd3
6. mmchcluster --ccr-enable

Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------
If you feel that your question can benefit other users of Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=rDOjWbVnVsp5M75VorQgDtZhxMrgvwIgV%2BReJgt5ZUs%3D&reserved=0>.

If your query concerns a potential software error in Spectrum Scale (GPFS) and 
you have an IBM software maintenance contract please contact 1-800-237-5511 in 
the United States or your local IBM Service Center in other countries.

The forum is informally monitored as time permits and should not be used for 
priority messages to the Spectrum Scale (GPFS) team.

<graycol.gif>"Oesterlin, Robert" ---09/20/2017 07:39:55 AM---OK – I’ve run 
across this before, and it’s because of a bug (as I recall) having to do with 
CCR and

From: "Oesterlin, Robert" 
<[email protected]<mailto:[email protected]>>
To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Date: 09/20/2017 07:39 AM
Subject: Re: [gpfsug-discuss] CCR cluster down for the count?
Sent by: 
[email protected]<mailto:[email protected]>

________________________________



OK – I’ve run across this before, and it’s because of a bug (as I recall) 
having to do with CCR and quorum. What I think you can do is set the cluster to 
non-ccr (mmchcluster –ccr-disable) with all the nodes down, bring it back up 
and then re-enable ccr.

I’ll see if I can find this in one of the recent 4.2 release nodes.


Bob Oesterlin
Sr Principal Storage Engineer, Nuance


From: 
<[email protected]<mailto:[email protected]>>
 on behalf of "Buterbaugh, Kevin L" 
<[email protected]<mailto:[email protected]>>
Reply-To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, September 19, 2017 at 4:03 PM
To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] [gpfsug-discuss] CCR cluster down for the count?

Hi All,

We have a small test cluster that is CCR enabled. It only had/has 3 NSD servers 
(testnsd1, 2, and 3) and maybe 3-6 clients. testnsd3 died a while back. I did 
nothing about it at the time because it was due to be life-cycled as soon as I 
finished a couple of higher priority projects.

Yesterday, testnsd1 also died, which took the whole cluster down. So now 
resolving this has become higher priority… ;-)

I took two other boxes and set them up as testnsd1 and 3, respectively. I’ve 
done a “mmsdrrestore -p testnsd2 -R /usr/bin/scp” on both of them. I’ve also 
done a "mmccr setup -F” and copied the ccr.disks and ccr.nodes files from 
testnsd2 to them. And I’ve copied /var/mmfs/gen/mmsdrfs from testnsd2 to 
testnsd1 and 3. In case it’s not obvious from the above, networking is fine … 
ssh without a password between those 3 boxes is fine.

However, when I try to startup GPFS … or run any GPFS command I get:

/root
root@testnsd2# mmstartup -a
get file failed: Not enough CCR quorum nodes available (err 809)
gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158
mmstartup: Command failed. Examine previous error messages to determine cause.
/root
root@testnsd2#

I’ve got to run to a meeting right now, so I hope I’m not leaving out any 
crucial details here … does anyone have an idea what I need to do? Thanks…

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=mBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y&s=YJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI&e=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=66K3H2yHjRwd%2F56tamS2itwN6%2Fg3fnVkLAl9D0M%2BWSQ%3D&reserved=0>



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=kBvEL7Kp2JMGuLIL4NX3UV7h3emaayQSbHr8O1F2CXc%3D&reserved=0

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] CCR cluster down for the count?

Reply via email to