Hmm...have you dumped waiters across the entire cluster or just on the NSD 
servers/fs managers? Maybe there’s a slow node out there participating in the 
suspend effort? Might be worth running some quick tracing on the FS manager to 
see what it’s up to.





On July 15, 2018 at 13:27:54 EDT, Buterbaugh, Kevin L 
<[email protected]> wrote:
Hi All,

We are in a partial cluster downtime today to do firmware upgrades on our 
storage arrays.  It is a partial downtime because we have two GPFS filesystems:

1.  gpfs23 - 900+ TB and which corresponds to /scratch and /data, and which 
I’ve unmounted across the cluster because it has data replication set to 1.

2.  gpfs22 - 42 TB and which corresponds to /home.  It has data replication set 
to two, so what we’re doing is “mmchdisk gpfs22 suspend -d <the gpfs22 NSD>”, 
then doing the firmware upgrade, and once the array is back we’re doing a 
“mmchdisk gpfs22 resume -d <NSD>”, followed by “mmchdisk gpfs22 start -d <NSD>”.

On the 1st storage array this went very smoothly … the mmchdisk took about 5 
minutes, which is what I would expect.

But on the 2nd storage array the mmchdisk appears to either be hung or 
proceeding at a glacial pace.  For more than an hour it’s been stuck at:

mmchdisk: Processing continues ...
Scanning file system metadata, phase 1 …

There are no waiters of any significance and “mmdiag —iohist” doesn’t show any 
issues either.

Any ideas, anyone?  Unless I can figure this out I’m hosed for this downtime, 
as I’ve got 7 more arrays to do after this one!

Thanks!

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to