In version 4.2.3 you can turn on QOS --fine-stats and --pid-stats and get IO operations statistics for each active process on each node.
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmchqos.htm https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmlsqos.htm The statistics allow you to distinguish single sector IOPS from partial block multisector iops from full block multisector iops. Notice that to use this feature you must enable QOS, but by default you start by running with all throttles set at "unlimited". There are some overheads, so you might want to use it only when you need to find the "bad" processes. It's a little tricky to use effectively, but we give you a sample script that shows some ways to produce, massage and filter the raw data: samples/charts/qosplotfine.pl The data is available in a CSV format, so it's easy to feed into spreadsheets or data bases and crunch... --marc of GPFS. From: "Andreas Petzold (SCC)" <andreas.petz...@kit.edu> To: <gpfsug-discuss@spectrumscale.org> Date: 05/30/2017 08:17 AM Subject: [gpfsug-discuss] Associating I/O operations with files/processes Sent by: gpfsug-discuss-boun...@spectrumscale.org Dear group, first a quick introduction: at KIT we are running a 20+PB storage system with several large (1-9PB) file systems. We have a 14 node NSD server cluster and 5 small (~10 nodes) protocol node clusters which each mount one of the file systems. The protocol nodes run server software (dCache, xrootd) specific to our users which primarily are the LHC experiments at CERN. GPFS version is 4.2.2 everywhere. All servers are connected via IB, while the protocol nodes communicate via Ethernet to their clients. Now let me describe the problem we are facing. Since a few days, one of the protocol nodes shows a very strange and as of yet unexplained I/O behaviour. Before we were usually seeing reads like this (iohist example from a well behaved node): 14:03:37.637526 R data 32:138835918848 8192 46.626 cli 0A417D79:58E3B179 172.18.224.19 14:03:37.660177 R data 18:12590325760 8192 25.498 cli 0A4179AD:58E3AE66 172.18.224.14 14:03:37.640660 R data 15:106365067264 8192 45.682 cli 0A4179AD:58E3ADD7 172.18.224.14 14:03:37.657006 R data 35:130482421760 8192 30.872 cli 0A417DAD:58E3B266 172.18.224.21 14:03:37.643908 R data 33:107847139328 8192 45.571 cli 0A417DAD:58E3B206 172.18.224.21 Since a few days we see this on the problematic node: 14:06:27.253537 R data 46:126258287872 8 15.474 cli 0A4179AB:58E3AE54 172.18.224.13 14:06:27.268626 R data 40:137280768624 8 0.395 cli 0A4179AD:58E3ADE3 172.18.224.14 14:06:27.269056 R data 46:56452781528 8 0.427 cli 0A4179AB:58E3AE54 172.18.224.13 14:06:27.269417 R data 47:97273159640 8 0.293 cli 0A4179AD:58E3AE5A 172.18.224.14 14:06:27.269293 R data 49:59102786168 8 0.425 cli 0A4179AD:58E3AE72 172.18.224.14 14:06:27.269531 R data 46:142387326944 8 0.340 cli 0A4179AB:58E3AE54 172.18.224.13 14:06:27.269377 R data 28:102988517096 8 0.554 cli 0A417879:58E3AD08 172.18.224.10 The number of read ops has gone up by O(1000) which is what one would expect when going from 8192 sector reads to 8 sector reads. We have already excluded problems of node itself so we are focusing on the applications running on the node. What we'd like to to is to associate the I/O requests either with files or specific processes running on the machine in order to be able to blame the correct application. Can somebody tell us, if this is possible and if now, if there are other ways to understand what application is causing this? Thanks, Andreas -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Andreas Petzold Hermann-von-Helmholtz-Platz 1, Building 449, Room 202 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 24916 Fax: +49 721 608 24972 Email: petz...@kit.edu www.scc.kit.edu KIT – The Research University in the Helmholtz Association Since 2010, KIT has been certified as a family-friendly university. [attachment "smime.p7s" deleted by Marc A Kaplan/Watson/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss