Are all of the slow IOs from the same NSD volumes?    

You could run an mmtrace and take an internaldump and open a ticket to the 
Spectrum Scale queue.  You may want to limit the run to just your nsd servers 
and not all nodes like I use in my example.     Or one of the tools we use to 
review a trace is available in /usr/lpp/mmfs/samples/debugtools/trsum.awk   and 
you can run it passing in the uncompressed trace file and redirect standard out 
to a file.     If you search for ' total '  in the trace you will find the 
different sections,  or you can just grep ' total IO ' trsum.out  | grep 
duration  to get a quick look per LUN.

mmtracectl --set --trace=def --tracedev-write-mode=overwrite 
--tracedev-overwrite-buffer-size=500M -N all
mmtracectl --start -N all ; sleep 30 ; mmtracectl --stop -N all  ; mmtracectl 
--off -N all 
mmdsh -N all "/usr/lpp/mmfs/bin/mmfsadm dump all 


    On Thursday, February 21, 2019, 7:23:46 AM EST, Frederick Stock 
<> wrote:  
 Kevin I'm assuming you have seen the article on IBM developerWorks about the 
GPFS NSD queues.  It provides useful background for analyzing the dump nsd 
information.  Here I'll list some thoughts for items that you can 
investigate/consider. If your NSD servers are doing both large (greater than 
64K) and small (64K or less) IOs then you want to have the nsdSmallThreadRatio 
set to 1 as it seems you do for the NSD servers.  This provides an equal number 
of SMALL and LARGE NSD queues.  You can also increase the total number of 
queues (currently 256) but I cannot determine if that is necessary from the 
data you provided.  Only on rare occasions have I seen a need to increase the 
number of queues. The fact that you have 71 highest pending on your LARGE 
queues and 73 highest pending on your SMALL queues would imply your IOs are 
queueing for a good while either waiting for resources in GPFS or waiting for 
IOs to complete.  Your maximum buffer size is 16M which is defined to be the 
largest IO that can be requested by GPFS.  This is the buffer size that GPFS 
will use for LARGE IOs.  You indicated you had sufficient memory on the NSD 
servers but what is the value for the pagepool on those servers, and what is 
the value of the nsdBufSpace parameter?   If the NSD server is just that then 
usually nsdBufSpace is set to 70.  The IO buffers used by the NSD server come 
from the pagepool so you need sufficient space there for the maximum number of 
LARGE IO buffers that would be used concurrently by GPFS or threads will need 
to wait for those buffers to become available.  Essentially you want to ensure 
you have sufficient memory for the maximum number of IOs all doing a large IO 
and that value being less than 70% of the pagepool size. You could look at the 
settings for the FC cards to ensure they are configured to do the largest IOs 
possible.  I forget the actual values (have not done this for awhile) but there 
are settings for the adapters that control the maximum IO size that will be 
sent.  I think you want this to be as large as the adapter can handle to reduce 
the number of messages needed to complete the large IOs done by GPFS.  Fred
Fred Stock | IBM Pittsburgh Lab | 720-430-8821  
----- Original message -----
From: "Buterbaugh, Kevin L" <>
Sent by:
To: gpfsug main discussion list <>
Subject: Re: [gpfsug-discuss] Clarification of mmdiag --iohist output
Date: Thu, Feb 21, 2019 6:39 AM
 Hi All, My thanks to Aaron, Sven, Steve, and whoever responded for the GPFS 
team.  You confirmed what I suspected … my example 10 second I/O was _from an 
NSD server_ … and since we’re in a 8 Gb FC SAN environment, it therefore means 
- correct me if I’m wrong about this someone - that I’ve got a problem 
somewhere in one (or more) of the following 3 components: 1) the NSD servers2) 
the SAN fabric3) the storage arrays I’ve been looking at all of the above and 
none of them are showing any obvious problems.  I’ve actually got a techie from 
the storage array vendor stopping by on Thursday, so I’ll see if he can spot 
anything there.  Our FC switches are QLogic’s, so I’m kinda screwed there in 
terms of getting any help.  But I don’t see any errors in the switch logs and 
“show perf” on the switches is showing I/O rates of 50-100 MB/sec on the in use 
ports, so I don’t _think_ that’s the issue. And this is the GPFS mailing list, 
after all … so let’s talk about the NSD servers.  Neither memory (64 GB) nor 
CPU (2 x quad-core Intel Xeon E5620’s) appear to be an issue.  But I have been 
looking at the output of “mmfsadm saferdump nsd” based on what Aaron and then 
Steve said.  Here’s some fairly typical output from one of the SMALL queues 
(I’ve checked several of my 8 NSD servers and they’re all showing similar 
output):     Queue NSD type NsdQueueTraditional [244]: SMALL, threads started 
12, active 3, highest 12, deferred 0, chgSize 0, draining 0, is_chg 0     
requests pending 0, highest pending 73, total processed 4859732     mutex 
0x7F3E449B8F10, reqCond 0x7F3E449B8F58, thCond 0x7F3E449B8F98, queue 
0x7F3E449B8EF0, nFreeNsdRequests 29 And for a LARGE queue:     Queue NSD type 
NsdQueueTraditional [8]: LARGE, threads started 12, active 1, highest 12, 
deferred 0, chgSize 0, draining 0, is_chg 0     requests pending 0, highest 
pending 71, total processed 2332966     mutex 0x7F3E441F3890, reqCond 
0x7F3E441F38D8, thCond 0x7F3E441F3918, queue 0x7F3E441F3870, nFreeNsdRequests 
31 So my large queues seem to be slightly less utilized than my small queues 
overall … i.e. I see more inactive large queues and they generally have a 
smaller “highest pending” value. Question:  are those non-zero “highest 
pending” values something to be concerned about? I have the following 
thread-related parameters set: [common]maxReceiverThreads 12nsdMaxWorkerThreads 
640nsdThreadsPerQueue 4nsdSmallThreadRatio 3workerThreads 128 
[serverLicense]nsdMaxWorkerThreads 1024nsdThreadsPerQueue 12nsdSmallThreadRatio 
1pitWorkerThreadsPerNode 3workerThreads 1024 Also, at the top of the “mmfsadm 
saferdump nsd” output I see: Total server worker threads: running 1008, desired 
147, forNSD 147, forGNR 0, nsdBigBufferSize 16777216nsdMultiQueue: 256, 
nsdMultiQueueType: 1, nsdMinWorkerThreads: 16, nsdMaxWorkerThreads: 1024 
Question:  is the fact that 1008 is pretty close to 1024 a concern? Anything 
jump out at anybody?  I don’t mind sharing full output, but it is rather 
lengthy.  Is this worthy of a PMR? Thanks! --Kevin Buterbaugh - Senior System 
AdministratorVanderbilt University - Advanced Computing Center for Research and - (615)875-9633 
On Feb 17, 2019, at 1:01 PM, IBM Spectrum Scale <> wrote: Hi 

The I/O hist shown by the command mmdiag --iohist actually depends on the node 
on which you are running this command from.
If you are running this on a NSD server node then it will show the time taken 
to complete/serve the read or write I/O operation sent from the client node. 
And if you are running this on a client (or non NSD server) node then it will 
show the complete time taken by the read or write I/O operation requested by 
the client node to complete.
So in a nut shell for the NSD server case it is just the latency of the I/O 
done on disk by the server whereas for the NSD client case it also the latency 
of send and receive of I/O request to the NSD server along with the latency of 
I/O done on disk by the NSD server.
I hope this answers your query.

Regards, The Spectrum Scale (GPFS) team

If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at

If your query concerns a potential software error in Spectrum Scale (GPFS) and 
you have an IBM software maintenance contract please contact  1-800-237-5511 in 
the United States or your local IBM Service Center in other countries.

The forum is informally monitored as time permits and should not be used for 
priority messages to the Spectrum Scale (GPFS) team.

From:        "Buterbaugh, Kevin L" <>
To:        gpfsug main discussion list <>
Date:        02/16/2019 08:18 PM
Subject:        [gpfsug-discuss] Clarification of mmdiag --iohist output
Sent by:

Hi All, 

Been reading man pages, docs, and Googling, and haven’t found a definitive 
answer to this question, so I knew exactly where to turn… ;-)

I’m dealing with some slow I/O’s to certain storage arrays in our environments 
… like really, really slow I/O’s … here’s just one example from one of my NSD 
servers of a 10 second I/O:

08:49:34.943186  W        data   30:41615622144   2048 10115.192  srv   dm-92   
               <client IP redacted>

So here’s my question … when mmdiag —iohist tells me that that I/O took 
slightly over 10 seconds, is that:

1.  The time from when the NSD server received the I/O request from the client 
until it shipped the data back onto the wire towards the client?
2.  The time from when the client issued the I/O request until it received the 
data back from the NSD server?
3.  Something else?

I’m thinking it’s #1, but want to confirm.  Which one it is has very obvious 
implications for our troubleshooting steps.  Thanks in advance…

Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education (615)875-9633
gpfsug-discuss mailing list
gpfsug-discuss at

gpfsug-discuss mailing list
gpfsug-discuss at;;sdata=5pL67mhVyScJovkRHRqZog9bM5BZG8F2q972czIYAbA%3D&amp;reserved=0
gpfsug-discuss mailing list
gpfsug-discuss at
gpfsug-discuss mailing list
gpfsug-discuss at
gpfsug-discuss mailing list
gpfsug-discuss at

Reply via email to