You could forcibly expel the node (one of my favorite GPFS commands):

mmexpelnode -N $nodename

and then power it off after the expulsion is complete and then do

mmepelenode -r -N $nodename

which will allow it to join the cluster next time you try and start up GPFS on it. You'll still likely have to go through recovery but you'll skip the part where GPFS wonders where the node went prior to it expelling it.

-Aaron

On 2/2/17 2:28 PM, [email protected] wrote:
On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said:

but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you
see a message like this..
have you reinstalled that node / any backup/restore thing ?

The internal RAID controller died a horrid death and basically took
all the OS partitions with it.  So the node was just sort of limping along,
where the mmfsd process was still coping because it wasn't doing any
I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work
because that requires accessing stuff in /var.

At that point, it starts getting tempting to just use ipmitool from
another node to power the comatose one down - but that often causes
a cascade of other issues while things are stuck waiting for timeouts.


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to