Kevin,

    While this is happening, are you able to grab latency stats per LUN 
(hardware vendor agnostic) to see if there are any outliers? Also, when looking 
at the mmdiag output, are both reads and writes affected? Depending on the 
storage hardware, your writes might be hitting cache, so maybe this problem is 
being exasperated by many small reads (that are too random to be coalesced, 
take advantage of drive NCQ, etc). 

    The other response about the nsd threads is also a good start, but if the 
I/O waits shift between different NSD servers and across hardware vendors, my 
assumption would be that you are hitting a bottleneck somewhere, but what you 
are seeing is symptoms of I/O backlog, which can manifest at any number of 
places. This could be something as low level as a few slow drives.

    Have you just started noticing this behavior? Any new applications on your 
system? Going by your institution, you're probably supposing a wide variety of 
codes, so if these problems just started happening, its possible that someone 
changed their code, or decided to run new scientific packages.

-Steve
________________________________________
From: [email protected] 
[[email protected]] on behalf of Buterbaugh, Kevin L 
[[email protected]]
Sent: Tuesday, July 03, 2018 11:43 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] High I/O wait times

Hi all,

We are experiencing some high I/O wait times (5 - 20 seconds!) on some of our 
NSDs as reported by “mmdiag —iohist" and are struggling to understand why.  One 
of the confusing things is that, while certain NSDs tend to show the problem 
more than others, the problem is not consistent … i.e. the problem tends to 
move around from NSD to NSD (and storage array to storage array) whenever we 
check … which is sometimes just a few minutes apart.

In the past when I have seen “mmdiag —iohist” report high wait times like this 
it has *always* been hardware related.  In our environment, the most common 
cause has been a battery backup unit on a storage array controller going bad 
and the storage array switching to write straight to disk.  But that’s *not* 
happening this time.

Is there anything within GPFS / outside of a hardware issue that I should be 
looking for??  Thanks!

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to