Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-31 Thread Oesterlin, Robert
A better way to detect node expels is to install the expelnode into 
/var/mmfs/etc/ (sample in /usr/lpp/mmfs/samples/expelnode.sample) - put this on 
your manager nodes. It runs on every expel and you can customize it pretty 
easily. We generate a Slack message to a specific channel:

GPFS Node Expel nrg1 APP [1:56 AM] nrg1-gpfs01 Expelling node gnj-r05r05u30, 
other node cnt-r04r08u40


Bob Oesterlin
Sr Principal Storage Engineer, Nuance


From:  on behalf of "Buterbaugh, 
Kevin L" 
Reply-To: gpfsug main discussion list 
Date: Thursday, January 31, 2019 at 9:19 AM
To: gpfsug main discussion list 
Subject: [EXTERNAL] Re: [gpfsug-discuss] Node ‘crash and restart’ event using 
GPFS callback?

Hi Bob,

We use the nodeLeave callback to detect node expels … for what you’re wanting 
to do I wonder if nodeJoin might work??  If a node joins the cluster and then 
has an uptime of a few minutes you could go looking in /tmp/mmfs.  HTH...

--
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
kevin.buterba...@vanderbilt.edu<mailto:kevin.buterba...@vanderbilt.edu> - 
(615)875-9633


On Jan 30, 2019, at 3:02 PM, Sanchez, Paul 
mailto:paul.sanc...@deshaw.com>> wrote:

There are some cases which I don’t believe can be caught with callbacks (e.g. 
DMS = Dead Man Switch).  But you could possibly use preStartup to check the 
host uptime to make an assumption if GPFS was restarted long after the host 
booted.  You could also peek in /tmp/mmfs and only report if you find something 
there.  That said, the docs say that preStartup fires after the node joins the 
cluster.  So if that means once the node is ‘active’ then you might miss out on 
nodes stuck in ‘arbitrating’ for a while due to a waiter problem.

We run a script with cron which monitors the myriad things which can go wrong 
and attempt to right those which are safe to fix, and raise alerts 
appropriately.  Something like that, outside the reach of GPFS, is often a good 
choice if you don’t need to know something the moment it happens.

Thx
Paul

From: 
gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org>
 
mailto:gpfsug-discuss-boun...@spectrumscale.org>>
 On Behalf Of Oesterlin, Robert
Sent: Wednesday, January 30, 2019 3:52 PM
To: gpfsug main discussion list 
mailto:gpfsug-discuss@spectrumscale.org>>
Subject: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Anyone crafted a good way to detect a node ‘crash and restart’ event using GPFS 
callbacks? I’m thinking “preShutdown” but I’m not sure if that’s the best. What 
I’m really looking for is did the node shutdown (abort) and create a dump in 
/tmp/mmfs


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

___
gpfsug-discuss mailing list
gpfsug-discuss at 
spectrumscale.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__spectrumscale.org_=DwMGaQ=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU=ppdUpGql5rzClFCWb7wAesP1sZuy9scOloPIQsjrVao=O81UdWPCUrX00RF0P-UNyLZ-lbTmgIaW-PpK4VrxgHs=>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discussdata=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cccd012a939124326a53908d686f64117%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636844789557921185sdata=9bMPd%2F%2B%2Babt6IdeFYcdznPBQwPrMLFsXHTBYISlyYGM%3Dreserved=0<https://urldefense.proofpoint.com/v2/url?u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fgpfsug.org-252Fmailman-252Flistinfo-252Fgpfsug-2Ddiscuss-26amp-3Bdata-3D02-257C01-257CKevin.Buterbaugh-2540vanderbilt.edu-257Cccd012a939124326a53908d686f64117-257Cba5a7f39e3be4ab3b45067fa80faecad-257C0-257C0-257C636844789557921185-26amp-3Bsdata-3D9bMPd-252F-252B-252Babt6IdeFYcdznPBQwPrMLFsXHTBYISlyYGM-253D-26amp-3Breserved-3D0=DwMGaQ=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU=ppdUpGql5rzClFCWb7wAesP1sZuy9scOloPIQsjrVao=ZaQTKkyDzA6XWNjMVXKrblv1I7frC1VIVFQ0Y-I1f8c=>

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-31 Thread Marc A Kaplan
Various "leave" / join events may be interesting ... But you've got to 
consider that an abrupt failure of several nodes is not necessarily 
recorded anywhere! For example, because the would be recording devices 
might all lose power at the same time.


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-31 Thread Buterbaugh, Kevin L
Hi Bob,

We use the nodeLeave callback to detect node expels … for what you’re wanting 
to do I wonder if nodeJoin might work??  If a node joins the cluster and then 
has an uptime of a few minutes you could go looking in /tmp/mmfs.  HTH...

--
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
kevin.buterba...@vanderbilt.edu - 
(615)875-9633

On Jan 30, 2019, at 3:02 PM, Sanchez, Paul 
mailto:paul.sanc...@deshaw.com>> wrote:

There are some cases which I don’t believe can be caught with callbacks (e.g. 
DMS = Dead Man Switch).  But you could possibly use preStartup to check the 
host uptime to make an assumption if GPFS was restarted long after the host 
booted.  You could also peek in /tmp/mmfs and only report if you find something 
there.  That said, the docs say that preStartup fires after the node joins the 
cluster.  So if that means once the node is ‘active’ then you might miss out on 
nodes stuck in ‘arbitrating’ for a while due to a waiter problem.

We run a script with cron which monitors the myriad things which can go wrong 
and attempt to right those which are safe to fix, and raise alerts 
appropriately.  Something like that, outside the reach of GPFS, is often a good 
choice if you don’t need to know something the moment it happens.

Thx
Paul

From: 
gpfsug-discuss-boun...@spectrumscale.org
 
mailto:gpfsug-discuss-boun...@spectrumscale.org>>
 On Behalf Of Oesterlin, Robert
Sent: Wednesday, January 30, 2019 3:52 PM
To: gpfsug main discussion list 
mailto:gpfsug-discuss@spectrumscale.org>>
Subject: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Anyone crafted a good way to detect a node ‘crash and restart’ event using GPFS 
callbacks? I’m thinking “preShutdown” but I’m not sure if that’s the best. What 
I’m really looking for is did the node shutdown (abort) and create a dump in 
/tmp/mmfs


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discussdata=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cccd012a939124326a53908d686f64117%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636844789557921185sdata=9bMPd%2F%2B%2Babt6IdeFYcdznPBQwPrMLFsXHTBYISlyYGM%3Dreserved=0

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-30 Thread Oesterlin, Robert
Actually, I think “preShutdown” will do it since it passes the type of shutdown 
“abnormal” for a crash to the call back - I can use that to send a Slack 
message.

mmaddcallback node-abort --event preShutdown --command 
/usr/local/sbin/callback-test.sh --parms "%eventName %reason"

and you get either:

preShutdown normal
preShutdown abnormal


Bob Oesterlin
Sr Principal Storage Engineer, Nuance


From:  on behalf of Marc A Kaplan 

Reply-To: gpfsug main discussion list 
Date: Wednesday, January 30, 2019 at 3:17 PM
To: gpfsug main discussion list 
Subject: [EXTERNAL] Re: [gpfsug-discuss] Node ‘crash and restart’ event using 
GPFS callback?

We have (pre)shutdown and pre(startup) ...
Trap  and record both... If you see a startup without a matching shutdown you 
know the shutdown never happened, because GPFS crashed.



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-30 Thread Dwayne.Hart
Could you get away with running “mmdiag —stats” and inspecting the uptime 
information it provides?

Best,
Dwayne
—
Dwayne Hart | Systems Administrator IV

CHIA, Faculty of Medicine
Memorial University of Newfoundland
300 Prince Philip Drive
St. John’s, Newfoundland | A1B 3V6
Craig L Dobbin Building | 4M409
T 709 864 6631

On Jan 30, 2019, at 5:32 PM, Sanchez, Paul 
mailto:paul.sanc...@deshaw.com>> wrote:

There are some cases which I don’t believe can be caught with callbacks (e.g. 
DMS = Dead Man Switch).  But you could possibly use preStartup to check the 
host uptime to make an assumption if GPFS was restarted long after the host 
booted.  You could also peek in /tmp/mmfs and only report if you find something 
there.  That said, the docs say that preStartup fires after the node joins the 
cluster.  So if that means once the node is ‘active’ then you might miss out on 
nodes stuck in ‘arbitrating’ for a while due to a waiter problem.

We run a script with cron which monitors the myriad things which can go wrong 
and attempt to right those which are safe to fix, and raise alerts 
appropriately.  Something like that, outside the reach of GPFS, is often a good 
choice if you don’t need to know something the moment it happens.

Thx
Paul

From: 
gpfsug-discuss-boun...@spectrumscale.org
 
mailto:gpfsug-discuss-boun...@spectrumscale.org>>
 On Behalf Of Oesterlin, Robert
Sent: Wednesday, January 30, 2019 3:52 PM
To: gpfsug main discussion list 
mailto:gpfsug-discuss@spectrumscale.org>>
Subject: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Anyone crafted a good way to detect a node ‘crash and restart’ event using GPFS 
callbacks? I’m thinking “preShutdown” but I’m not sure if that’s the best. What 
I’m really looking for is did the node shutdown (abort) and create a dump in 
/tmp/mmfs


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-30 Thread Marc A Kaplan
We have (pre)shutdown and pre(startup) ...
Trap  and record both... If you see a startup without a matching shutdown 
you know the shutdown never happened, because GPFS crashed.





From:   "Oesterlin, Robert" 
To: gpfsug main discussion list 
Date:   01/30/2019 05:52 PM
Subject:[gpfsug-discuss] Node ‘crash and restart’ event using GPFS 
callback?
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Anyone crafted a good way to detect a node ‘crash and restart’ event using 
GPFS callbacks? I’m thinking “preShutdown” but I’m not sure if that’s the 
best. What I’m really looking for is did the node shutdown (abort) and 
create a dump in /tmp/mmfs
 
 
Bob Oesterlin
Sr Principal Storage Engineer, Nuance
 ___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8=oBQHDWo5PVKthJjmbVrQyqSrkuFZEcMQb_tXtvcKepE=HfF_wArTvc-i4wLfATXbwrImRT-w0mKG8mhctBJFLCI=





___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-30 Thread Sanchez, Paul
There are some cases which I don’t believe can be caught with callbacks (e.g. 
DMS = Dead Man Switch).  But you could possibly use preStartup to check the 
host uptime to make an assumption if GPFS was restarted long after the host 
booted.  You could also peek in /tmp/mmfs and only report if you find something 
there.  That said, the docs say that preStartup fires after the node joins the 
cluster.  So if that means once the node is ‘active’ then you might miss out on 
nodes stuck in ‘arbitrating’ for a while due to a waiter problem.

We run a script with cron which monitors the myriad things which can go wrong 
and attempt to right those which are safe to fix, and raise alerts 
appropriately.  Something like that, outside the reach of GPFS, is often a good 
choice if you don’t need to know something the moment it happens.

Thx
Paul

From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Oesterlin, Robert
Sent: Wednesday, January 30, 2019 3:52 PM
To: gpfsug main discussion list 
Subject: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Anyone crafted a good way to detect a node ‘crash and restart’ event using GPFS 
callbacks? I’m thinking “preShutdown” but I’m not sure if that’s the best. What 
I’m really looking for is did the node shutdown (abort) and create a dump in 
/tmp/mmfs


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss