Hmm...have you dumped waiters across the entire cluster or just on the NSD servers/fs managers? Maybe there’s a slow node out there participating in the suspend effort? Might be worth running some quick tracing on the FS manager to see what it’s up to.
On July 15, 2018 at 13:27:54 EDT, Buterbaugh, Kevin L <[email protected]> wrote: Hi All, We are in a partial cluster downtime today to do firmware upgrades on our storage arrays. It is a partial downtime because we have two GPFS filesystems: 1. gpfs23 - 900+ TB and which corresponds to /scratch and /data, and which I’ve unmounted across the cluster because it has data replication set to 1. 2. gpfs22 - 42 TB and which corresponds to /home. It has data replication set to two, so what we’re doing is “mmchdisk gpfs22 suspend -d <the gpfs22 NSD>”, then doing the firmware upgrade, and once the array is back we’re doing a “mmchdisk gpfs22 resume -d <NSD>”, followed by “mmchdisk gpfs22 start -d <NSD>”. On the 1st storage array this went very smoothly … the mmchdisk took about 5 minutes, which is what I would expect. But on the 2nd storage array the mmchdisk appears to either be hung or proceeding at a glacial pace. For more than an hour it’s been stuck at: mmchdisk: Processing continues ... Scanning file system metadata, phase 1 … There are no waiters of any significance and “mmdiag —iohist” doesn’t show any issues either. Any ideas, anyone? Unless I can figure this out I’m hosed for this downtime, as I’ve got 7 more arrays to do after this one! Thanks! — Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education [email protected]<mailto:[email protected]> - (615)875-9633
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
