OK, after a bit of a delay due to a hectic travel week, here is some more 
information on my GPFS performance collection. At the bottom, I have links to 
my server and client zimon config files and a link to my presentation at SSUG 
Argonne in June. I didn't actually present it but included it in case there was 
interest.

I used to do a home brew system of period calls to mmpmon to collect data, 
sticking them into a kafka database. This was a bit cumbersome and when SS 4.2 
arrived, I switched over to the built in performance sensors (zimon) to collect 
the data. IBM has a "as-is" bridge between Grafana and the Zimon collector that 
works reasonably well - they were supposed to release it but it's been delayed 
- I will ask about it again and post more information if I get it.

My biggest struggle with the zimon configuration is the large memory 
requirement of the collector with large clusters (many clients, file systems, 
NSDs). I ended up deploying a 6 collector federation of 16gb per collector for 
my larger clusters -0 even then I have to limit the number of stats and amount 
of time I retain it. IBM is aware of the memory issue and I believe they are 
looking at ways to reduce it.

As for what specific metrics I tend to look at:

gpfs_fis_bytes_read (written) - aggregated file system read and write stats
gpfs_nsdpool_bytes_read (written) - aggregated pool stats, as I have data and 
metadata split
gpfs_fs_tot_disk_wait_rd (wr) - NSD disk wait stats

These seem to make the most sense for me to get an overall sense of how things 
are going. I have a bunch of other more details dashboards for individual file 
systems and clients that help me get details. The built-in SS GUI is pretty 
good for small clusters, and is getting some improvements in 4.2.1 that might 
make me take a closer look at it again.

I also look at the RPC waiters stats - no present in 4.2.0 grafana, but I hear 
are coming in 4.2.1

My SSUG Argonne Presentation (I didn't talk due to time constraints): 
http://files.gpfsug.org/presentations/2016/anl-june/SSUG_Nuance_PerfTools.pdf

Zimon server config file: 
https://www.dropbox.com/s/gvtfhhqfpsknfnh/ZIMonSensors.cfg.server?dl=0
Zimon client config file: 
https://www.dropbox.com/s/k5i6rcnaco4vxu6/ZIMonSensors.cfg.client?dl=0


Bob Oesterlin
Sr Storage Engineer, Nuance HPC Grid


From: <[email protected]> on behalf of Brian Marshall 
<[email protected]>
Reply-To: gpfsug main discussion list <[email protected]>
Date: Wednesday, July 13, 2016 at 8:43 AM
To: "[email protected]" <[email protected]>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Aggregating filesystem performance 
(Oesterlin, Robert)

Robert,

1) Do you see any noticeable performance impact by running the performance 
monitoring?

2) Can you share the zimon configuration that you use? i.e. what metrics do you 
find most useful?

Thank you,
Brian Marshall
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to