Re: [gpfsug-discuss] Monitor NSD server queue?

Oesterlin, Robert Thu, 18 Aug 2016 07:48:06 -0700

Done.

Notification generated at: 18 Aug 2016, 10:46 AM Eastern Time (ET)


ID:                                                93260
Headline:                                    Give sysadmin insight into the 
inner workings of the NSD server machinery, in particular the queue dynamics
Submitted on:                            18 Aug 2016, 10:46 AM Eastern Time (ET)
Brand:                                          Servers and Systems Software
Product:                                      Spectrum Scale (formerly known as 
GPFS) - Public RFEs

Link:                                            
http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=93260


Bob Oesterlin
Sr Storage Engineer, Nuance HPC Grid
507-269-0413


From: <[email protected]> on behalf of Yuri L Volobuev 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Wednesday, August 17, 2016 at 3:34 PM
To: gpfsug main discussion list <[email protected]>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Monitor NSD server queue?


Unfortunately, at the moment there's no safe mechanism to show the usage 
statistics for different NSD queues. "mmfsadm saferdump nsd" as implemented 
doesn't acquire locks when parsing internal data structures. Now, NSD data 
structures are fairly static, as much things go, so the risk of following a 
stale pointer and hitting a segfault isn't particularly significant. I don't 
think I remember ever seeing mmfsd crash with NSD dump code on the stack. That 
said, this isn't code that's tested and known to be safe for production use. I 
haven't seen a case myself where an mmfsd thread gets stuck running this dump 
command, either, but Bob has. If that condition ever reoccurs, I'd be 
interested in seeing debug data.

I agree that there's value in giving a sysadmin insight into the inner workings 
of the NSD server machinery, in particular the queue dynamics. mmdiag should be 
enhanced to allow this. That'd be a very reasonable (and doable) RFE.

yuri

[nactive hide details for "Oesterlin, Robert" ---08/17/2016 04:45:30 
AM---]"Oesterlin, Robert" ---08/17/2016 04:45:30 AM---Hi Aaron You did a 
perfect job of explaining a situation I've run into time after time - high 
latenc

From: "Oesterlin, Robert" <[email protected]>
To: gpfsug main discussion list <[email protected]>,
Date: 08/17/2016 04:45 AM
Subject: Re: [gpfsug-discuss] Monitor NSD server queue?
Sent by: [email protected]

________________________________



Hi Aaron

You did a perfect job of explaining a situation I've run into time after time - 
high latency on the disk subsystem causing a backup in the NSD queues. I was 
doing what you suggested not to do - "mmfsadm saferdump nsd' and looking at the 
queues. In my case 'mmfsadm saferdump" would usually work or hang, rather than 
kill mmfsd. But - the hang usually resulted it a tied up thread in mmfsd, so 
that's no good either.

I wish I had better news - this is the only way I've found to get visibility to 
these queues. IBM hasn't seen fit to gives us a way to safely look at these. I 
personally think it's a bug that we can't safely dump these structures, as they 
give insight as to what's actually going on inside the NSD server.

Yuri, Sven - thoughts?


Bob Oesterlin
Sr Storage Engineer, Nuance HPC Grid



From: <[email protected]> on behalf of "Knister, Aaron 
S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" <[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Tuesday, August 16, 2016 at 8:46 PM
To: gpfsug main discussion list <[email protected]>
Subject: [EXTERNAL] [gpfsug-discuss] Monitor NSD server queue?

Hi Everyone,

We ran into a rather interesting situation over the past week. We had a job 
that was pounding the ever loving crap out of one of our filesystems (called 
dnb02) doing about 15GB/s of reads. We had other jobs experience a slowdown on 
a different filesystem (called dnb41) that uses entirely separate backend 
storage. What I can't figure out is why this other filesystem was affected. 
I've checked IB bandwidth and congestion, Fibre channel bandwidth and errors, 
Ethernet bandwidth congestion, looked at the mmpmon nsd_ds counters (including 
disk request wait time), and checked out the disk iowait values from collectl. 
I simply can't account for the slowdown on the other filesystem. The only thing 
I can think of is the high latency on dnb02's NSDs caused the mmfsd NSD queues 
to back up.

Here's my question-- how can I monitor the state of th NSD queues? I can't find 
anything in mmdiag. An mmfsadm saferdump NSD shows me the queues and their 
status. I'm just not sure calling saferdump NSD every 10 seconds to monitor 
this data is going to end well. I've seen saferdump NSD cause mmfsd to die and 
that's from a task we only run every 6 hours that calls saferdump NSD.

Any thoughts/ideas here would be great.

Thanks!

-Aaron_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss<https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=CwMFAg&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=D8iCz340ioiUrtGkAFdKjfgfitPkpOr1nRkkxTRCBn0&s=ncd-C59bavCSUTkgYH1vH4ewOM12Hajhy-KhFtKZK68&e=>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Monitor NSD server queue?

Reply via email to