Re: [gpfsug-discuss] Running the Spectrum Scale on a Compute-only Cluster ?
Hi, >> The Spectrum Scale GUI is widely run on GPFS clusters that include their own >> storage, but what about multi-cluster with separate Storage and Compute >> clusters? Yes, Spectrum Scale GUI works on GPFS Multi-Cluster setup with separate Storage and Compute clusters. The GUI works fine on a compute only cluster and storage only cluster. >> And if so does it simply omit components that are not present - such as >> Recovery Groups and NSD Servers ? Correct. The compute cluster GUI panels will still show remotely mounted file system and filesets + show their health status. Cheers, -Kums Kumaran Rajaram [cid:image001.png@01D83549.C3601B90] From: gpfsug-discuss-boun...@spectrumscale.org On Behalf Of Kidger, Daniel Sent: Friday, March 11, 2022 12:34 PM To: gpfsug-discuss@spectrumscale.org Subject: [gpfsug-discuss] Running the Spectrum Scale on a Compute-only Cluster ? The Spectrum Scale GUI is widely run on GPFS clusters that include their own storage, but what about multi-cluster with separate Storage and Compute clusters? Will the GUI run on a compute only cluster? And if so does it simply omit components that are not present - such as Recovery Groups and NSD Servers ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kid...@hpe.com<mailto:daniel.kid...@hpe.com> +44 (0)7818 522266 hpe.com<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.hpe.com%2F=04%7C01%7Ckrajaram%40geocomputing.net%7Cf06fe96e62f547943a7908da0387b381%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637826179064346264%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000=1St%2BPcHkdNSeWnQea10AFKx7JOUsO4oql%2FLcLm5z3IA%3D=0> ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] IO sizes
Hi Uwe, >> But what puzzles me even more: one of the server compiles IOs even smaller, >> varying between 3.2MiB and 3.6MiB mostly - both for reads and writes ... I >> just cannot see why. IMHO, If GPFS on this particular NSD server was restarted often during the setup, then it is possible that the GPFS pagepool may not be contiguous. As a result, GPFS 8MiB buffer in the pagepool might be a scatter-gather (SG) list with many small entries (in the memory) resulting in smaller I/O when these buffers are issued to the disks. The fix would be to reboot the server and start GPFS so that pagepool is contiguous resulting in 8MiB buffer to be comprised of 1 (or fewer) SG entries. >>In the current situation (i.e. with IOs bit larger than 4MiB) setting >>max_sectors_kB to 4096 might do the trick, but as I do not know the cause for >>that behaviour it might well start to issue IOs >>smaller than 4MiB again at >>some point, so that is not a nice solution. It will be advised not to restart GPFS often in the NSD servers (in production) to keep the pagepool contiguous. Ensure that there is enough free memory in NSD server and not run any memory intensive jobs so that pagepool is not impacted (e.g. swapped out). Also, enable GPFS numaMemoryInterleave=yes and verify that pagepool is equally distributed across the NUMA domains for good performance. GPFS numaMemoryInterleave=yes requires that numactl packages are installed and then GPFS restarted. # mmfsadm dump config | egrep "numaMemory|pagepool " ! numaMemoryInterleave yes ! pagepool 282394099712 # pgrep mmfsd | xargs numastat -p Per-node process memory usage (in MBs) for PID 2120821 (mmfsd) Node 0 Node 1 Total --- --- --- Huge 0.000.000.00 Heap 1.263.264.52 Stack0.010.010.02 Private 137710.43 137709.96 275420.39 --- --- --- Total 137711.70 137713.23 275424.92 My two cents, -Kums Kumaran Rajaram [cid:image001.png@01D82960.6A9860C0] From: gpfsug-discuss-boun...@spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, February 23, 2022 8:04 PM To: gpfsug-discuss@spectrumscale.org Subject: Re: [gpfsug-discuss] IO sizes Hi, the test bench is gpfsperf running on up to 12 clients with 1...64 threads doing sequential reads and writes , file size per gpfsperf process is 12TB (with 6TB I saw caching effects in particular for large thread numbers ...) As I wrote initially: GPFS is issuing nothing but 8MiB IOs to the data disks, as expected in that case. Interesting thing though: I have rebooted the suspicious node. Now, it does not issue smaller IOs than the others, but -- unbelievable -- larger ones (up to about 4.7MiB). This is still harmful as also that size is incompatible with full stripe writes on the storage ( 8+2 disk groups, i.e. logically RAID6) Currently, I draw this information from the storage boxes; I have not yet checked iostat data for that benchmark test after the reboot (before, when IO sizes were smaller, we saw that both in iostat and in the perf data retrieved from the storage controllers). And: we have a separate data pool , hence dataOnly NSDs, I am just talking about these ... As for "Are you sure that Linux OS is configured the same on all 4 NSD servers?." - of course there are not two boxes identical in the world. I have actually not installed those machines, and, yes, i also considered reinstalling them (or at least the disturbing one). However, I do not have reason to assume or expect a difference, the supplier has just implemented these systems recently from scratch. In the current situation (i.e. with IOs bit larger than 4MiB) setting max_sectors_kB to 4096 might do the trick, but as I do not know the cause for that behaviour it might well start to issue IOs smaller than 4MiB again at some point, so that is not a nice solution. Thanks Uwe On 23.02.22 22:20, Andrew Beattie wrote: Alex, Metadata will be 4Kib Depending on the filesystem version you will also have subblocks to consider V4 filesystems have 1/32 subblocks, V5 filesystems have 1/1024 subblocks (assuming metadata and data block size is the same) My first question would be is “ Are you sure that Linux OS is configured the same on all 4 NSD servers?. My second question would be do you know what your average file size is if most of your files are smaller than your filesystem block size, then you are always going to be performing writes using groups of subblocks rather than a full block writes. Regards, Andrew On 24 Feb 2022, at 04:39, Alex Chekholko <mailto:a...@calicolabs.com> wrote:
Re: [gpfsug-discuss] du --apparent-size and quota
Hi, >> If I'm not mistaken even with SS5 created filesystems, 1 MiB FS block size >> implies 32 kiB sub blocks (32 sub-blocks). Just to add: The /srcfilesys seemed to have been created with GPFS version 4.x which supports only 32 sub-blocks per block. -T /srcfilesys Default mount point -V 16.00 (4.2.2.0) Current file system version 14.10 (4.1.0.4) Original file system version --create-time Tue Feb 3 11:46:10 2015 File system creation time -B 1048576 Block size -f 32768Minimum fragment (subblock) size in bytes --subblocks-per-full-block 32 Number of subblocks per full block The /dstfilesys was created with GPFS version 5.x which support greater than 32 subblocks per block. /dstfilesys does have 512 subblocks-per-full-block with 8KiB subblock size since file-system blocksize is 4MiB. -T /dstfilesys Default mount point -V 23.00 (5.0.5.0) File system version --create-time Tue May 11 16:51:27 2021 File system creation time -B 4194304 Block size -f 8192 Minimum fragment (subblock) size in bytes --subblocks-per-full-block 512 Number of subblocks per full block Hope this helps, -Kums -Original Message- From: gpfsug-discuss-boun...@spectrumscale.org On Behalf Of Loic Tortay Sent: Tuesday, June 1, 2021 10:57 AM To: gpfsug main discussion list ; Ulrich Sibiller ; gpfsug-disc...@gpfsug.org Subject: Re: [gpfsug-discuss] du --apparent-size and quota On 6/1/21 4:26 PM, Ulrich Sibiller wrote: [...] > ) > > While trying to understand what's going on here I found this on the > source file system (which is valid for all files, with different > number of course): > > $ du --block-size 1 /srcfilesys/fileset/filename > 65536 /srcfilesys/fileset/filename > > $ du --apparent-size --block-size 1 /srcfilesys/fileset/filename > 3994 /srcfilesys/fileset/filename > > $ stat /srcfilesys/fileset/filename > File: ‘/srcfilesys/fileset/filename’ > Size: 3994 Blocks: 128 IO Block: 1048576 regular > file > Device: 2ah/42d Inode: 23266095 Links: 1 > Access: (0660/-rw-rw) Uid: (73018/ cpunnoo) Gid: (50070/ > dc-rti) > Context: system_u:object_r:unlabeled_t:s0 > Access: 2021-05-12 20:10:13.814459305 +0200 > Modify: 2020-07-16 11:08:41.631006000 +0200 > Change: 2020-07-16 11:08:41.630896177 +0200 > Birth: - > Hello, This looks like the sub-block overhead. If I'm not mistaken even with SS5 created filesystems, 1 MiB FS block size implies 32 kiB sub blocks (32 sub-blocks). The sub-block is the minimum disk allocation for files (if the file content is too large to be kept in the inode, when that is supported on the specific GPFS filesystem). The "Blocks" value displayed by "stat" is in 512 bytes unit, so 128*512 = 65536 (which is consistent with "du"): two 32 kiB sub-blocks due to data replication. The "--apparent-size" option to "du" uses the user visible size not the actual disk usage (per the man page), so 3994 is also consistent w/ "stat" output. AFAIK, GPFS space quotas count the sub-blocks not the apparent sizes, so again this would be consistent with the overhead. Beside the overhead, hard-links in the source FS (which, if I'm not mistaken, are not handled by "rsync" unless you specify "-H") and in some cases spare files can also explain the differences. Loïc. -- | Loïc Tortay - IN2P3 Computing Centre | ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Spectrum Scale - how to get RPO=0
Hi Tom, >>we are trying to implement a mixed linux/windows environment and we have one >>thing at the top - is there any global method to avoid asynchronous I/O and >>write everything in >>synchronous mode? If the local and remote sites have good inter-site network bandwidth and low-latency, then you may consider using GPFS synchronous replication at the file-system level (-m 2 -r 2). The Spectrum Scale documentation (link below) has further details. https://www.ibm.com/docs/en/spectrum-scale/5.1.0?topic=data-synchronous-mirroring-gpfs-replication Regards, -Kums From: gpfsug-discuss-boun...@spectrumscale.org On Behalf Of Tomasz Rachobinski Sent: Monday, May 24, 2021 9:06 AM To: gpfsug-discuss@spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale - how to get RPO=0 Hello everyone, we are trying to implement a mixed linux/windows environment and we have one thing at the top - is there any global method to avoid asynchronous I/O and write everything in synchronous mode? Another thing is - if there is no global sync setting, how to enforce sync i/o from linux/windows client? Greetings Tom Rachobinski ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Client Latency and High NSD Server Load Average
Hi, >> I do notice nsd03/nsd04 have long waiters, but nsd01 doesn't (nsd02-ib is offline for now): Please issue "mmlsdisk -m" in NSD client to ascertain the active NSD server serving a NSD. Since nsd02-ib is offlined, it is possible that some servers would be serving higher NSDs than the rest. https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_PoorPerformanceDuetoDiskFailure.htm https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_HealthStateOfNSDserver.htm >> From the waiters you provided I would guess there is something amiss with some of your storage systems. Please ensure there are no "disk rebuild" pertaining to certain NSDs/storage volumes in progress (in the storage subsystem) as this can sometimes impact block-level performance and thus impact latency, especially for write operations. Please ensure that the hardware components constituting the Spectrum Scale stack are healthy and performing optimally. https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_pspduetosyslevelcompissue.htm Please refer to the Spectrum Scale documentation (link below) for potential causes (e.g. Scale maintenance operation such as mmapplypolicy/mmestripefs in progress, slow disks) that can be contributing to this issue: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_performanceissues.htm Thanks and Regards, -Kums Kumaran Rajaram Spectrum Scale Development, IBM Systems k...@us.ibm.com From: "Frederick Stock" To: gpfsug-discuss@spectrumscale.org Cc: gpfsug-discuss@spectrumscale.org Date: 06/04/2020 07:08 AM Subject:[EXTERNAL] Re: [gpfsug-discuss] Client Latency and High NSD Server Load Average Sent by:gpfsug-discuss-boun...@spectrumscale.org >From the waiters you provided I would guess there is something amiss with some of your storage systems. Since those waiters are on NSD servers they are waiting for IO requests to the kernel to complete. Generally IOs are expected to complete in milliseconds, not seconds. You could look at the output of "mmfsadm dump nsd" to see how the GPFS IO queues are working but that would be secondary to checking your storage systems. Fred __ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 sto...@us.ibm.com - Original message - From: "Saula, Oluwasijibomi" Sent by: gpfsug-discuss-boun...@spectrumscale.org To: "gpfsug-discuss@spectrumscale.org" Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] Client Latency and High NSD Server Load Average Date: Wed, Jun 3, 2020 6:24 PM Frederick, Yes on both counts! - mmdf is showing pretty uniform (ie 5 NSDs out of 30 report 65% free; All others are uniform at 58% free)... NSD servers per disks are called in round-robin fashion as well, for example: gpfs1 tier2_001nsd02-ib,nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib gpfs1 tier2_002nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib gpfs1 tier2_003nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib gpfs1 tier2_004tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib,nsd04-ib Any other potential culprits to investigate? I do notice nsd03/nsd04 have long waiters, but nsd01 doesn't (nsd02-ib is offline for now): [nsd03-ib ~]# mmdiag --waiters === mmdiag: waiters === Waiting 6.5113 sec since 17:17:33, monitored, thread 4175 NSDThread: for I/O completion Waiting 6.3810 sec since 17:17:33, monitored, thread 4127 NSDThread: for I/O completion Waiting 6.1959 sec since 17:17:34, monitored, thread 4144 NSDThread: for I/O completion nsd04-ib: Waiting 13.1386 sec since 17:19:09, monitored, thread 9971 NSDThread: for I/O completion Waiting 10.3562 sec since 17:19:12, monitored, thread 9958 NSDThread: for I/O completion Waiting 10.0338 sec since 17:19:12, monitored, thread 9951 NSDThread: for I/O completion tsm01-ib: Waiting 8.1211 sec since 17:20:24, monitored, thread 3644 NSDThread: for I/O completion Waiting 7.6690 sec since 17:20:24, monitored, thread 3641 NSDThread: for I/O completion Waiting 7.4969 sec since 17:20:24, monitored, thread 3658 NSDThread: for I/O completion Waiting 7.3573 sec since 17:20:24, monitored, thread 3642 NSDThread: for I/O completion nsd01-ib: Waiting 0.2548 sec since 17:21:47, monitored, thread 30513 NSDThread: for I/O completion Waiting 0.1502 sec since 17:21:47, monitored, thread 30529 NSDThread: for I/O completion Thanks, Oluwasijibomi (Siji) Saula HPC Systems Administrator / Information Technology Research 2 Building 220B / Fargo ND 58108-6050 p: 701.231.7749 / www.ndsu.edu From: gpfsug-discuss-boun...@spectrumscale.org on behalf of gpfsug-discuss-requ...@spectrumscale.org Sent: Wednesday, June 3, 2
Re: [gpfsug-discuss] How to prove that data is in inode
Hi, >> How can I prove that data of a small file is stored in the inode (and not on a data nsd)? You may use echo "inode file_inode_number" | tsdbfs fs_device | grep indirectionLevel and if it points to INODE, then the file is stored in the inodes # 4K Inode Size # mmlsfs gpfs3a | grep 'Inode size' -i 4096 Inode size in bytes # Small file # ls -l /mnt/gpfs3a/hello.txt -rw-r--r-- 1 root root 6 Jul 17 08:32 /mnt/gpfs3a/hello.txt # ls -i /mnt/gpfs3a/hello.txt 91649 /mnt/gpfs3a/hello.txt #File is inlined within Inode # echo "inode 91649" | tsdbfs gpfs3a | grep indirectionLevel indirectionLevel=INODE status=USERFILE Regards, -Kums From: "Billich Heinrich Rainer (ID SD)" To: gpfsug main discussion list Date: 07/17/2019 07:49 AM Subject:[EXTERNAL] [gpfsug-discuss] How to prove that data is in inode Sent by:gpfsug-discuss-boun...@spectrumscale.org Hello, How can I prove that data of a small file is stored in the inode (and not on a data nsd)? We have a filesystem with 4k inodes on Scale 5.0.2 , but it seems there is no file data in the inodes? I would expect that 'stat' reports 'Blocks: 0' for a small file, but I see 'Blocks:1'. Cheers, Heiner I tried []# rm -f test; echo hello > test []# ls -ls test 1 -rw-r--r-- 1 root root 6 Jul 17 13:11 test [root@testnas13ems01 test]# stat test File: ‘test’ Size: 6Blocks: 1 IO Block: 1048576 regular file Device: 2dh/45d Inode: 353314 Links: 1 Access: (0644/-rw-r--r--) Uid: (0/root) Gid: (0/root) Access: 2019-07-17 13:11:03.037049000 +0200 Modify: 2019-07-17 13:11:03.037331000 +0200 Change: 2019-07-17 13:11:03.037259319 +0200 Birth: - [root@testnas13ems01 test]# du test 1test [root@testnas13ems01 test]# du -b test 6test [root@testnas13ems01 test]# Filesystem # mmlsfs f flagvaluedescription --- --- -f 32768Minimum fragment (subblock) size in bytes -i 4096 Inode size in bytes -I 32768Indirect block size in bytes -m 1Default number of metadata replicas -M 2Maximum number of metadata replicas -r 1Default number of data replicas -R 2Maximum number of data replicas -j cluster Block allocation type -D nfs4 File locking semantics in effect -k nfs4 ACL semantics in effect -n 32 Estimated number of nodes that will mount file system -B 1048576 Block size -Q user;group;fileset Quotas accounting enabled user;group;fileset Quotas enforced user;group;fileset Default quotas enabled --perfileset-quota Yes Per-fileset quota enforcement --filesetdfYes Fileset df enabled? -V 20.01 (5.0.2.0) Current file system version 15.01 (4.2.0.0) Original file system version --create-time * 2017 File system creation time -z No Is DMAPI enabled? -L 33554432 Logfile size -E Yes Exact mtime mount option -S relatime Suppress atime mount option -K whenpossible Strict replica allocation option --fastea Yes Fast external attributes enabled? --encryption No Encryption enabled? --inode-limit 1294592 Maximum number of inodes in all inode spaces --log-replicas 0Number of log replicas --is4KAligned Yes is4KAligned? --rapid-repair Yes rapidRepair enabled? --write-cache-threshold 0 HAWC Threshold (max 65536) --subblocks-per-full-block 32 Number of subblocks per full block -P system;data Disk storage pools in file system --file-audit-log No File Audit Logging enabled? --maintenance-mode No Maintenance Mode enabled? -d ** -A yes Automatic mount option -o nfssync,nodevAdditional mount options -T / Default mount point --mount-priority 0Mount priority -- === Heinrich
Re: [gpfsug-discuss] verbs status not working in 5.0.2
Hi, This issue is resolved in the latest 5.0.3.1 release. # mmfsadm dump version | grep Build Build branch "5.0.3.1 ". # mmfsadm test verbs status VERBS RDMA status: started Regards, -Kums From: Ryan Novosielski To: "gpfsug-discuss@spectrumscale.org" Date: 06/11/2019 03:46 PM Subject:[EXTERNAL] Re: [gpfsug-discuss] verbs status not working in 5.0.2 Sent by:gpfsug-discuss-boun...@spectrumscale.org -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thanks -- this was originally how Lenovo told us to check this, and I came across `mmfsadm test verbs status` on my own. I'm thinking, though, isn't there some risk that if RDMA went down somehow, that wouldn't be caught by your script? I can't say that I normally see that as the failure mode (it's most often booting up without), nor do I know what happens to `mmfsadm test verbs status` if you pull a cable or something. On 6/11/19 3:37 PM, Bryan Banister wrote: > This has been brocket for a long time... we too were checking that > `mmfsadm test verbs status` reported that RDMA is working. We > don't want nodes that are not using RDMA running in the cluster. > > We have decided to just look for the log entry like this: > test_gpfs_rdma_active() { [[ "$(grep -c "VERBS RDMA started" > /var/adm/ras/mmfs.log.latest)" == "1" ]] } > > Hope that helps, -Bryan - -- || \\UTGERS, |--*O* ||_// the State |Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -BEGIN PGP SIGNATURE- iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXQAE3gAKCRCZv6Bp0Ryx vpvpAJ9KnVX79aXNu3oclxM6swYfZ5wKjQCeJF3s94tS7+2JtTlkc5OXV/E8LnI= =kBtE -END PGP SIGNATURE- ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] NSD network checksums (nsdCksumTraditional)
In non-GNR setup, nsdCksumTraditional=yes enables data-integrity checking between a traditional NSD client node and its NSD server, at the network level only. The ESS storage supports end-to-end checksum, NSD client to the ESS IO servers (at the network level) as well as from ESS IO servers to the disk/storage. This is further detailed in the docs (link below): https://www.ibm.com/support/knowledgecenter/en/SSYSP8_5.3.1/com.ibm.spectrum.scale.raid.v5r01.adm.doc/bl1adv_introe2echecksum.htm Best, -Kums From: Stephen Ulmer To: gpfsug main discussion list Date: 10/29/2018 04:52 PM Subject:Re: [gpfsug-discuss] NSD network checksums (nsdCksumTraditional) Sent by:gpfsug-discuss-boun...@spectrumscale.org So the ESS checksums that are highly touted as "protecting all the way to the disk surface" completely ignore the transfer between the client and the NSD server? It sounds like you are saying that all of the checksumming done for GNR is internal to GNR and only protects against bit-flips on the disk (and in staging buffers, etc.) I’m asking because your explanation completely ignores calculating anything on the NSD client and implies that the client could not participate, given that it does not know about the structure of the vdisks under the NSD — but that has to be a performance factor for both types if the transfer is protected starting at the client — which it is in the case of nsdCksumTraditional which is what we are comparing to ESS checksumming. If ESS checksumming doesn’t protect on the wire I’d say that marketing has run amok, because that has *definitely* been implied in meetings for which I’ve been present. In fact, when asked if Spectrum Scale provides checksumming for data in-flight, IBM sales has used it as an ESS up-sell opportunity. -- Stephen On Oct 29, 2018, at 3:56 PM, Kumaran Rajaram wrote: Hi, >>How can it be that the I/O performance degradation warning only seems to accompany the nsdCksumTraditional setting and not GNR? >>Why is there such a penalty for "traditional" environments? In GNR IO/NSD servers (ESS IO nodes), the checksums are computed in parallel for a NSD (storage volume/vdisk) across the threads handling each pdisk/drive (that constitutes the vdisk/volume). This is possible since the GNR software on the ESS IO servers is tightly integrated with underlying storage and is aware of the vdisk DRAID configuration (strip-size, pdisk constituting vdisk etc.) to perform parallel checksum operations. In non-GNR + external storage model, the GPFS software on the NSD server(s) does not manage the underlying storage volume (this is done by storage RAID controllers) and the checksum is computed serially. This would contribute to increase in CPU usage and I/O performance degradation (depending on I/O access patterns, I/O load etc). My two cents. Regards, -Kums From:Aaron Knister To:gpfsug main discussion list Date:10/29/2018 12:34 PM Subject:[gpfsug-discuss] NSD network checksums (nsdCksumTraditional) Sent by:gpfsug-discuss-boun...@spectrumscale.org Flipping through the slides from the recent SSUG meeting I noticed that in 5.0.2 one of the features mentioned was the nsdCksumTraditional flag. Reading up on it it seems as though it comes with a warning about significant I/O performance degradation and increase in CPU usage. I also recall that data integrity checking is performed by default with GNR. How can it be that the I/O performance degradation warning only seems to accompany the nsdCksumTraditional setting and not GNR? As someone who knows exactly 0 of the implementation details, I'm just naively assuming that the checksum are being generated (in the same way?) in both cases and transferred to the NSD server. Why is there such a penalty for "traditional" environments? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] NSD network checksums (nsdCksumTraditional)
Hi, >>How can it be that the I/O performance degradation warning only seems to accompany the nsdCksumTraditional setting and not GNR? >>Why is there such a penalty for "traditional" environments? In GNR IO/NSD servers (ESS IO nodes), the checksums are computed in parallel for a NSD (storage volume/vdisk) across the threads handling each pdisk/drive (that constitutes the vdisk/volume). This is possible since the GNR software on the ESS IO servers is tightly integrated with underlying storage and is aware of the vdisk DRAID configuration (strip-size, pdisk constituting vdisk etc.) to perform parallel checksum operations. In non-GNR + external storage model, the GPFS software on the NSD server(s) does not manage the underlying storage volume (this is done by storage RAID controllers) and the checksum is computed serially. This would contribute to increase in CPU usage and I/O performance degradation (depending on I/O access patterns, I/O load etc). My two cents. Regards, -Kums From: Aaron Knister To: gpfsug main discussion list Date: 10/29/2018 12:34 PM Subject:[gpfsug-discuss] NSD network checksums (nsdCksumTraditional) Sent by:gpfsug-discuss-boun...@spectrumscale.org Flipping through the slides from the recent SSUG meeting I noticed that in 5.0.2 one of the features mentioned was the nsdCksumTraditional flag. Reading up on it it seems as though it comes with a warning about significant I/O performance degradation and increase in CPU usage. I also recall that data integrity checking is performed by default with GNR. How can it be that the I/O performance degradation warning only seems to accompany the nsdCksumTraditional setting and not GNR? As someone who knows exactly 0 of the implementation details, I'm just naively assuming that the checksum are being generated (in the same way?) in both cases and transferred to the NSD server. Why is there such a penalty for "traditional" environments? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Tuning: single client, single thread, small files - native Scale vs NFS
Hi Alexander, 1. >>When writing to GPFS directly I'm able to write ~1800 files / second in a test setup. >>This is roughly the same on the protocol nodes (NSD client), as well as on the ESS IO nodes (NSD server). 2. >> When writing to the NFS export on the protocol node itself (to avoid any network effects) I'm only able to write ~230 files / second. IMHO #2, writing to the NFS export on the protocol node should be same as #1. Protocol node is also a NSD client and when you write from a protocol node, it will use the NSD protocol to write to the ESS IO nodes. In #1, you cite seeing ~1800 files from protocol node and in #2 you cite seeing ~230 file/sec which seem to contradict each other. >>Writing to the NFS export from another node (now including network latency) gives me ~220 files / second. IMHO, this workload "single client, single thread, small files, single directory - tar xf" is synchronous is nature and will result in single outstanding file to be sent from the NFS client to the CES node. Hence, the performance will be limited by network latency/capability between the NFS client and CES node for small IO size (~5KB file size). Also, what is the network interconnect/interface between the NFS client and CES node? Is the network 10GigE since @220 file/s for 5KiB file-size will saturate 1 x 10GigE link. 220 files/sec * 5KiB file size ==> ~1.126 GB/s. >> I'm aware that 'the real thing' would be to work with larger files in a multithreaded manner from multiple nodes - and that this scenario will scale quite well. Yes, larger file-size + multiple threads + multiple NFS client nodes will help to scale performance further by having more NFS I/O requests scheduled/pipelined over the network and processed on the CES nodes. >> I just want to ensure that I'm not missing something obvious over reiterating that massage to customers. Adding NFS experts/team, for advise. My two cents. Best Regards, -Kums From: "Alexander Saupp" To: gpfsug-discuss@spectrumscale.org Date: 10/15/2018 02:20 PM Subject:[gpfsug-discuss] Tuning: single client, single thread, small files - native Scale vs NFS Sent by:gpfsug-discuss-boun...@spectrumscale.org Dear Spectrum Scale mailing list, I'm part of IBM Lab Services - currently i'm having multiple customers asking me for optimization of a similar workloads. The task is to tune a Spectrum Scale system (comprising ESS and CES protocol nodes) for the following workload: A single Linux NFS client mounts an NFS export, extracts a flat tar archive with lots of ~5KB files. I'm measuring the speed at which those 5KB files are written (`time tar xf archive.tar`). I do understand that Spectrum Scale is not designed for such workload (single client, single thread, small files, single directory), and that such benchmark in not appropriate to benmark the system. Yet I find myself explaining the performance for such scenario (git clone..) quite frequently, as customers insist that optimization of that scenario would impact individual users as it shows task duration. I want to make sure that I have optimized the system as much as possible for the given workload, and that I have not overlooked something obvious. When writing to GPFS directly I'm able to write ~1800 files / second in a test setup. This is roughly the same on the protocol nodes (NSD client), as well as on the ESS IO nodes (NSD server). When writing to the NFS export on the protocol node itself (to avoid any network effects) I'm only able to write ~230 files / second. Writing to the NFS export from another node (now including network latency) gives me ~220 files / second. There seems to be a huge performance degradation by adding NFS-Ganesha to the software stack alone. I wonder what can be done to minimize the impact. - Ganesha doesn't seem to support 'async' or 'no_wdelay' options... anything equivalent available? - Is there and expected advantage of using the network-latency tuned profile, as opposed to the ESS default throughput-performance profile? - Are there other relevant Kernel params? - Is there an expected advantage of raising the number of threads (NSD server (nsd*WorkerThreads) / NSD client (workerThreads) / Ganesha (NB_WORKER)) for the given workload (single client, single thread, small files)? - Are there other relevant GPFS params? - Impact of Sync replication, disk latency, etc is understood. - I'm aware that 'the real thing' would be to work with larger files in a multithreaded manner from multiple nodes - and that this scenario will scale quite well. I just want to ensure that I'm not missing something obvious over reiterating that massage to customers. Any help was greatly appreciated - thanks much in advance! Alexander Saupp IBM Germany Mit freundlichen Grüßen / Kind regards Alexander Saupp IBM Systems, Storage Platform, EMEA Storage Competence Center Phone: +49 7034-643-1512 IBM Deutschland GmbH
Re: [gpfsug-discuss] What NSDs does a file have blocks on?
Hi Kevin, >>I want to know what NSDs a single file has its’ blocks on? You may use /usr/lpp/mmfs/samples/fpo/mmgetlocationto obtain the file-to-NSD block layout map. Use the -h option for this tools usage ( mmgetlocation -h). Sample output is below: # File-system block size is 4MiB and sample file is 40MiB. # ls -lh /mnt/gpfs3a/data_out/lf -rw-r--r-- 1 root root 40M Jul 9 16:42 /mnt/gpfs3a/data_out/lf # du -sh /mnt/gpfs3a/data_out/lf 40M /mnt/gpfs3a/data_out/lf # mmlsfs gpfs3a | grep 'Block size' -B 4194304 Block size # The file data is striped across 10 x NSDs (DMD_NSDX) constituting the file-system # /usr/lpp/mmfs/samples/fpo/mmgetlocation -f /mnt/gpfs3a/data_out/lf [FILE /mnt/gpfs3a/data_out/lf INFORMATION] FS_DATA_BLOCKSIZE : 4194304 (bytes) FS_META_DATA_BLOCKSIZE : 4194304 (bytes) FS_FILE_DATAREPLICA : 1 FS_FILE_METADATAREPLICA : 1 FS_FILE_STORAGEPOOLNAME : system FS_FILE_ALLOWWRITEAFFINITY : no FS_FILE_WRITEAFFINITYDEPTH : 0 FS_FILE_BLOCKGROUPFACTOR : 1 chunk(s)# 0 (offset 0) : [DMD_NSD5 c72f1m5u37ib0,c72f1m5u39ib0] chunk(s)# 1 (offset 4194304) : [DMD_NSD6 c72f1m5u39ib0,c72f1m5u37ib0] chunk(s)# 2 (offset 8388608) : [DMD_NSD7 c72f1m5u37ib0,c72f1m5u39ib0] chunk(s)# 3 (offset 12582912) : [DMD_NSD8 c72f1m5u39ib0,c72f1m5u37ib0] chunk(s)# 4 (offset 16777216) : [DMD_NSD9 c72f1m5u37ib0,c72f1m5u39ib0] chunk(s)# 5 (offset 20971520) : [DMD_NSD10 c72f1m5u39ib0,c72f1m5u37ib0] chunk(s)# 6 (offset 25165824) : [DMD_NSD1 c72f1m5u37ib0,c72f1m5u39ib0] chunk(s)# 7 (offset 29360128) : [DMD_NSD2 c72f1m5u39ib0,c72f1m5u37ib0] chunk(s)# 8 (offset 33554432) : [DMD_NSD3 c72f1m5u37ib0,c72f1m5u39ib0] chunk(s)# 9 (offset 37748736) : [DMD_NSD4 c72f1m5u39ib0,c72f1m5u37ib0] [FILE: /mnt/gpfs3a/data_out/lf SUMMARY INFO] replica1: c72f1m5u37ib0,c72f1m5u39ib0: 5 chunk(s) c72f1m5u39ib0,c72f1m5u37ib0: 5 chunk(s) Thanks and Regards, -Kums From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 07/09/2018 04:05 PM Subject:[gpfsug-discuss] What NSDs does a file have blocks on? Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi All, I am still working on my issue of the occasional high I/O wait times and that has raised another question … I know that I can run mmfileid to see what files have a block on a given NSD, but is there a way to do the opposite? I.e. I want to know what NSDs a single file has its’ blocks on? The mmlsattr command does not appear to show this information unless it’s got an undocumented option. Thanks… Kevin — Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education kevin.buterba...@vanderbilt.edu - (615)875-9633 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Lroc on NVME
Hi, >>Yes, older versions of GPFS don't recognize /dev/nvme*. So you would need /var/mmfs/etc/nsddevices user exit. >>On newer GPFS versions, the nvme devices are also generic but has anyone else tried to get lroc running on nvme and how well does it work. IMHO, the support to recognize /dev/nvme* was added to Spectrum Scale version 5.0.1. The Spectrum Scale version 5.0.1 has LROC performance enhancements compared to the earlier versions, for file stat/read performance from LROC devices. Following provides sample performance data using single 1 x mdtest MPI process with LROC data on INTEL SSDPEDMD016T4L for Spectrum Scale version 5.0.0 vs. 5.0.1. Benchmark Arguments: mpiexec -f $MACH_FILE -n $MAX_NP $BENCHMARK -i 1 -n $n_files -u -F -T -E -e $file_sz -d $O_DIR Sample: mpiexec -f /mnt/sw_x86/mpich/mf.perf_x86.c72f1m5u27 -n 1 /mnt/sw_x86/benchmarks/mdtest/mdtest -i 1 -n 65536 -u -F -T -E -e '1024' -d /mnt/gpfs3a/lroc_mdtest_out/uniq_dir_1024_65536 5.0.0 5.0.1 Performance Delta (%) File metadata Ops/s File metadata Ops/s File StatFile Read File CountFile Size (KiB)File StatFile Read File StatFile Read 16384141703979 51824765 24.2919.77 163843239383585 52204609 32.5728.56 65536115111319 21221893 40.4643.50 65536321418661 2214803 56.1821.52 Best Regards, -Kums - Original message -From: "Truong Vu" Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug-discuss@spectrumscale.orgCc:Subject: Re: [gpfsug-discuss] Lroc on NVMEDate: Tue, Jun 12, 2018 9:55 AM Yes, older versions of GPFS don't recognize /dev/nvme*. So you would need /var/mmfs/etc/nsddevices user exit. On newer GPFS versions, the nvme devices are also generic. So, it is good that you are using the same NSD sub-type.Cheers,Tru.gpfsug-discuss-request---06/12/2018 06:47:05 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss@spectrumscale.orgFrom: gpfsug-discuss-requ...@spectrumscale.orgTo: gpfsug-discuss@spectrumscale.orgDate: 06/12/2018 06:47 AMSubject:
Re: [gpfsug-discuss] GPFS 4.2.3.4 question
Hi Kevin, >> Thanks - important followup question … does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again… I presume, by "mmrestripefs data loss bug" you are referring to APAR IV98609 (link below)? If yes, 4.2.3.4 contains the fix for APAR IV98609. http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 Problems fixed in GPFS 4.2.3.4 (details in link below): https://www.ibm.com/developerworks/community/forums/html/topic?id=f3705faa-b6aa-415c-a3e6-1fe9d8293db1=25 * This update addresses the following APARs: IV98545 IV98609 IV98640 IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 IV99059 IV99060 IV99062 IV99063. Regards, -Kums From: "Buterbaugh, Kevin L"To: gpfsug main discussion list Date: 08/27/2017 09:32 AM Subject:Re: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by:gpfsug-discuss-boun...@spectrumscale.org Fred / All, Thanks - important followup question … does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again… Kevin On Aug 26, 2017, at 7:35 PM, Frederick Stock wrote: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 sto...@us.ibm.com From:"Buterbaugh, Kevin L" To:gpfsug main discussion list Date:08/26/2017 03:40 PM Subject:[gpfsug-discuss] GPFS 4.2.3.4 question Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I’d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great… Thanks! — Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education kevin.buterba...@vanderbilt.edu- (615)875-9633 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0= ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=McIf98wfiVqHU8ZygezLrQ=0rUCqrbJ4Ny44Rmr8x8HvX5q4yqS-4tkN02fiIm9ttg=FYfr0P3sVBhnGGsj33W-A9JoDj7X300yTt5D4y5rpJY= ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing
Hi, >>I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? >> I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Please ensure that all the recommended FPO settings (e.g. allowWriteAffinity=yes in the FPO storage pool, readReplicaPolicy=local, restripeOnDiskFailure=yes) are set properly. Please find the FPO Best practices/tunings, in the links below: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Big%20Data%20Best%20practices https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ab5c2792-feef-4a3a-a21b-d22c6f5d728a/attachment/80d5c300-7b39-4d6e-9596-84934fcc4638/media/Deploying_a_big_data_solution_using_IBM_Spectrum_Scale_v1.7.5.pdf >> For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). >> Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. With FPO, GPFS metadata (-m) and data replication (-r) needs to be enabled. The Write-affinity-Depth (WAD) setting defines the policy for directing writes. It indicates that the node writing the data directs the write to disks on its own node for the first copy and to the disks on other nodes for the second and third copies (if specified). readReplicaPolicy=local will enable the policy to read replicas from local disks. At the minimum, ensure that the networking used for GPFS is sized properly and has bandwidth 2X or 3X that of the local disk speeds to ensure FPO write bandwidth is not being constrained by GPFS replication over the network. For example, if 24 x Drives in RAID-0 results in ~4.8 GB/s (assuming ~200MB/s per drive) and GPFS metadata/data replication is set to 3 (-m 3 -r 3) then for optimal FPO write bandwidth, we need to ensure the network-interconnect between the FPO nodes is non-blocking/high-speed and can sustain ~14.4 GB/s ( data_replication_factor * local_storage_bandwidth). One possibility, is minimum of 2 x EDR Infiniband (configure GPFS verbsRdma/verbsPorts) or bonded 40GigE between the FPO nodes (for GPFS daemon-to-daemon communication). Application reads requiring FPO reads from remote GPFS node would as well benefit from high-speed network-interconnect between the FPO nodes. Regards, -Kums From: Evan KoutsandreouTo: "gpfsug-discuss@spectrumscale.org" Date: 08/20/2017 11:06 PM Subject:[gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi - I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. Thank you ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Baseline testing GPFS with gpfsperf
Hi Scott, >>- Should the number of threads equal the number of NSDs for the file system? or equal to the number of nodes? >>- If I execute a large multi-threaded run of this tool from a single node in the cluster, will that give me an accurate result of the performance of the file system? To add to Valdis's note, the answer to above also depends on the node, network used for GPFS communication between client and server, as well as storage performance capabilities constituting the GPFS cluster/network/storage stack. As an example, if the storage subsystem (including controller + disks) hosting the file-system can deliver ~20 GB/s and the networking between NSD client and server is FDR 56Gb/s Infiniband (with verbsRdma = ~6GB/s). Assuming, one FDR-IB link (verbsPorts) is configured per NSD server as well as client, then you could need minimum of 4 x NSD servers (4 x 6GB/s ==> 24 GB/s) to saturate the backend storage. So, you would need to run gpfsperf (or anyother parallel I/O benchmark) across minimum of 4 x GPFS NSD clients to saturate the backend storage. You can scale the gpfsperf thread counts (-th parameter) depending on access pattern (buffered/dio etc) but this would only be able to drive load from single NSD client node. If you would like to drive I/O load from multiple NSD client nodes + synchronize the parallel runs across multiple nodes for accuracy, then gpfsperf-mpi would be strongly recommended. You would need to use MPI to launch gpfsperf-mpi across multiple NSD client nodes and scale the MPI processes (across NSD clients with 1 or more MPI process per NSD client) accordingly to drive the I/O load for good performance. >>The cluster that I will be running this tool on will not have MPI installed and will have multiple file systems in the cluster. Without MPI, alternative would be to use ssh or pdsh to launch gpfsperf across multiple nodes however if there are slow NSD clients then the performance may not be accurate (slow clients taking longer and after faster clients finished it will get all the network/storage resources skewing the performance analysis. You may also consider using parallel Iozone as it can be run across multiple node using rsh/ssh with combination of "-+m" and "-t" option. http://iozone.org/docs/IOzone_msword_98.pdf ## -+m filename Use this file to obtain the configuration informati on of the clients for cluster testing. The file contains one line for each client. Each line has th ree fields. The fields are space delimited. A # sign in column zero is a comment line. The first fi eld is the name of the client. The second field is the path, on the client, for the working directory where Iozone will execute. The third field is the path, on the client, for the executable Iozone. To use this option one must be able to execute comm ands on the clients without being challenged for a password. Iozone will start remote execution by using “rsh" To use ssh, export RSH=/usr/bin/ssh -t # Run Iozone in a throughput mode. This option allows the user to specify how many threads or processes to have active during th e measurement. ## Hope this helps, -Kums From: valdis.kletni...@vt.edu To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> Date: 07/25/2017 07:59 PM Subject:Re: [gpfsug-discuss] Baseline testing GPFS with gpfsperf Sent by:gpfsug-discuss-boun...@spectrumscale.org On Tue, 25 Jul 2017 15:46:45 -0500, "Scott C Batchelder" said: > - Should the number of threads equal the number of NSDs for the file > system? or equal to the number of nodes? Depends on what definition of "throughput" you are interested in. If your configuration has 50 clients banging on 5 NSD servers, your numbers for 5 threads and 50 threads are going to tell you subtly different things... (Basically, one thread per NSD is going to tell you the maximum that one client can expect to get with little to no contention, while one per client will tell you about the maximum *aggregate* that all 50 can get together - which is probably still giving each individual client less throughput than one-to-one) We usually test with "exactly one thread total", "one thread per server", and "keep piling the clients on till the total number doesn't get any bigger". Also be aware that it only gives you insight to your workload performance if your workload is comprised of large file access - if your users are actually doing a lot of medium or small files, that changes the results dramatically as you end up possibly pounding on metadata more than the actual data [attachment "att0twxd.dat" deleted by Kumaran Rajaram/Arlington/IBM] ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] get free space in GSS
Hi Atmane, >> I can not find the free space Based on your output below, your setup currently has two recovery groups BB1RGL and BB1RGR. Issue "mmlsrecoverygroup BB1RGL -L" and "mmlsrecoverygroup BB1RGR -L" to obtain free space in each DA. Based on your "mmlsrecoverygroup BB1RGL -L" output below, BB1RGL "DA1" has 12GiB and "DA2" has 4GiB free space. The metadataOnly and dataOnly vdisk/NSD are created from DA1 and DA2. declustered needsreplace scrub background activity array service vdisks pdisks spares threshold free space duration task progress priority --- --- -- -- -- - -- - LOG no1 3 0,0 1 558 GiB 14 days scrub 51% low DA1 no 11 582,31 2 12 GiB 14 days scrub 78% low DA2 no6 582,31 24096 MiB 14 days scrub 10% low In addition, you may use "mmlsnsd" to obtain mapping of file-system to vdisk/NSD + use "mmdf " command to query user or available capacity on a GPFS file system. Hope this helps, -Kums From: atmane khiredineTo: Laurence Horrocks-Barlow , "gpfsug main discussion list" Date: 07/09/2017 08:27 AM Subject:Re: [gpfsug-discuss] get free space in GSS Sent by:gpfsug-discuss-boun...@spectrumscale.org thank you very much for replying. I can not find the free space Here is the output of mmlsrecoverygroup [root@server1 ~]#mmlsrecoverygroup declustered arrays with recovery groupvdisks vdisks servers -- --- -- --- BB1RGL3 18 server1,server2 BB1RGR3 18 server2,server1 -- [root@server ~]# mmlsrecoverygroup BB1RGL -L declustered recovery group arrays vdisks pdisks format version - --- -- -- -- BB1RGL 3 18 119 4.2.0.1 declustered needsreplace scrub background activity array service vdisks pdisks spares threshold free space duration task progress priority --- --- -- -- -- - -- - LOG no1 3 0,0 1 558 GiB 14 days scrub 51% low DA1 no 11 582,31 2 12 GiB 14 days scrub 78% low DA2 no6 582,31 24096 MiB 14 days scrub 10% low declustered checksum vdisk RAID code array vdisk size block size granularity state remarks -- -- --- -- -- --- - --- gss0_logtip 3WayReplication LOG 128 MiB 1 MiB 512 oklogTip gss0_loghome4WayReplication DA1 40 GiB 1 MiB 512 oklog BB1RGL_GPFS4_META14WayReplication DA1 451 GiB 1 MiB 32 KiB ok BB1RGL_GPFS4_DATA18+2pDA15133 GiB 1 MiB 32 KiB ok BB1RGL_GPFS1_META14WayReplication DA1 451 GiB 1 MiB 32 KiB ok BB1RGL_GPFS1_DATA18+2pDA1 12 TiB 1 MiB 32 KiB ok BB1RGL_GPFS3_META1 4WayReplication DA1 451 GiB 1 MiB 32 KiB ok BB1RGL_GPFS3_DATA1 8+2pDA1 12 TiB 1 MiB 32 KiB ok BB1RGL_GPFS2_META1 4WayReplication DA1 451 GiB 1 MiB 32 KiB ok BB1RGL_GPFS2_DATA1 8+2pDA1 13 TiB 2 MiB 32 KiB ok BB1RGL_GPFS2_META2 4WayReplication DA2 451 GiB 1 MiB 32 KiB ok BB1RGL_GPFS2_DATA2 8+2pDA2 13 TiB 2 MiB 32 KiB ok BB1RGL_GPFS1_META24WayReplication DA2 451 GiB 1 MiB 32 KiB ok BB1RGL_GPFS1_DATA28+2pDA2 12 TiB 1 MiB 32 KiB ok BB1RGL_GPFS5_META1 4WayReplication DA1 750 GiB 1 MiB 32 KiB ok BB1RGL_GPFS5_DATA1 8+2pDA1 70 TiB 16 MiB 32 KiB ok BB1RGL_GPFS5_META2 4WayReplication DA2 750 GiB 1 MiB 32 KiB ok BB1RGL_GPFS5_DATA2 8+2pDA2 90 TiB 16 MiB 32 KiB ok config data declustered array VCD spares actual rebuild
Re: [gpfsug-discuss] IO prioritisation / throttling?
Hi John, >>We have a GPFS Setup using Fujitsu filers and Mellanox infiniband. >>The desire it to set up an environment for test and development where if IO ‘runs wild’ it will not bring down >>the production storage. You may use the Spectrum Scale Quality of Service for I/O "mmchqos" command (details in link below) to define IOPS limits for the "others" as well as the "maintenance" class for the Dev/Test file-system "pools" (for e.g., mmchqos tds_fs --enable pool=*,other=1IOPS, maintenance=5000IOPS). This way, the Test and Dev file-system/storage-pools IOPS can be limited/controlled to specified IOPS , giving higher priority to the production GPFS file-system/storage (with production_fs pool=* other=unlimited,maintenance=unlimited - which is the default). https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmchqos.htm https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_qosio_describe.htm#qosio_describe My two cents. Regards, -Kums From: John HearnsTo: gpfsug main discussion list Date: 06/23/2017 04:14 AM Subject:[gpfsug-discuss] IO prioritisation / throttling? Sent by:gpfsug-discuss-boun...@spectrumscale.org I guess this is a rather ill-defined question, and I realise it will be open to a lot of interpretations. We have a GPFS Setup using Fujitsu filers and Mellanox infiniband. The desire it to set up an environment for test and development where if IO ‘runs wild’ it will not bring down the production storage. If anyone has a setup like this I would be interested in chatting with you. Is it feasible to create filesets which have higher/lower priority than others? Thankyou for any insights or feedback John Hearns -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] 4.2.3.x and sub-block size
Hi, >>Back at SC16 I was told that GPFS 4.2.3.x would remove the “a sub-block is 1/32nd of the block size” restriction. However, I have installed GPFS 4.2.3.1 on my test cluster and in the man page for mmcrfs I still see: >>So has the restriction been removed? If not, is there an update on which version of GPFS will remove it? If so, can the documentation be updated to reflect the change and how to take advantage of it? Thanks… Based on the current plan, this “a sub-block is 1/32nd of the block size” restriction will be removed in the upcoming GPFS version 4.2.4 (Please NOTE: Support for >32 subblocks per block may subject to be delayed based on internal qualification/validation efforts). Regards, -Kums From: "Buterbaugh, Kevin L"To: gpfsug main discussion list Date: 06/14/2017 12:12 PM Subject:[gpfsug-discuss] 4.2.3.x and sub-block size Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi All, Back at SC16 I was told that GPFS 4.2.3.x would remove the “a sub-block is 1/32nd of the block size” restriction. However, I have installed GPFS 4.2.3.1 on my test cluster and in the man page for mmcrfs I still see: 2. The GPFS block size determines: * The minimum disk space allocation unit. The minimum amount of space that file data can occupy is a sub‐block. A sub‐block is 1/32 of the block size. So has the restriction been removed? If not, is there an update on which version of GPFS will remove it? If so, can the documentation be updated to reflect the change and how to take advantage of it? Thanks… Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education kevin.buterba...@vanderbilt.edu - (615)875-9633 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Well, this is the pits...
>>Thanks for the info on the releases … can you clarify about pitWorkerThreadsPerNode? pitWorkerThreadsPerNode -- Specifies how many threads do restripe, data movement, etc >>As I said in my original post, on all 8 NSD servers and the filesystem manager it is set to zero. No matter how many times I add zero to zero I don’t get a value > 31! ;-) So I take it that zero has some sort of unspecified significance? Thanks… Value of 0 just indicates pitWorkerThreadsPerNode takes internal_value based on GPFS setup and file-system configuration (which can be 16 or lower) based on the following formula. Default is pitWorkerThreadsPerNode = MIN(16, (numberOfDisks_in_filesystem * 4) / numberOfParticipatingNodes_in_mmrestripefs + 1) For example, if you have 64 x NSDs in your file-system and you are using 8 NSD servers in "mmrestripefs -N", then pitWorkerThreadsPerNode = MIN (16, (256/8)+1) resulting in pitWorkerThreadsPerNode to take value of 16 ( default 0 will result in 16 threads doing restripe per mmrestripefs participating Node). If you want 8 NSD servers (running 4.2.2.3) to participate in mmrestripefs operation then set "mmchconfig pitWorkerThreadsPerNode=3 -N <8_NSD_Servers>" such that (8 x 3) is less than 31. Regards, -Kums From: "Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> Date: 05/04/2017 12:57 PM Subject:Re: [gpfsug-discuss] Well, this is the pits... Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi Kums, Thanks for the info on the releases … can you clarify about pitWorkerThreadsPerNode? As I said in my original post, on all 8 NSD servers and the filesystem manager it is set to zero. No matter how many times I add zero to zero I don’t get a value > 31! ;-) So I take it that zero has some sort of unspecified significance? Thanks… Kevin On May 4, 2017, at 11:49 AM, Kumaran Rajaram <k...@us.ibm.com> wrote: Hi, >>I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 4.2.0.3 and are gradually being upgraded). What version of GPFS fixes this? With what I’m doing I need the ability to run mmrestripefs. GPFS version 4.2.3.0 (and above) fixes this issue and supports "sum of pitWorkerThreadsPerNode of the participating nodes (-N parameter to mmrestripefs)" to exceed 31. If you are using 4.2.2.3, then depending on "number of nodes participating in the mmrestripefs" then the GPFS config parameter "pitWorkerThreadsPerNode" need to be adjusted such that "sum of pitWorkerThreadsPerNode of the participating nodes <= 31". For example, if "number of nodes participating in the mmrestripefs" is 6 then adjust "mmchconfig pitWorkerThreadsPerNode=5 -N ". GPFS would need to be restarted for this parameter to take effect on the participating_nodes (verify with mmfsadm dump config | grep pitWorkerThreadsPerNode) Regards, -Kums From:"Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu> To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> Date:05/04/2017 12:08 PM Subject:Re: [gpfsug-discuss] Well, this is the pits... Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi Olaf, I didn’t touch pitWorkerThreadsPerNode … it was already zero. I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 4.2.0.3 and are gradually being upgraded). What version of GPFS fixes this? With what I’m doing I need the ability to run mmrestripefs. It seems to me that mmrestripefs could check whether QOS is enabled … granted, it would have no way of knowing whether the values used actually are reasonable or not … but if QOS is enabled then “trust” it to not overrun the system. PMR time? Thanks.. Kevin On May 4, 2017, at 10:54 AM, Olaf Weiser <olaf.wei...@de.ibm.com> wrote: HI Kevin, the number of NSDs is more or less nonsense .. it is just the number of nodes x PITWorker should not exceed to much the #mutex/FS block did you adjust/tune the PitWorker ? ... so far as I know.. that the code checks the number of NSDs is already considered as a defect and will be fixed / is already fixed ( I stepped into it here as well) ps. QOS is the better approach to address this, but unfortunately.. not everyone is using it by default... that's why I suspect , the development decide to put in a check/limit here .. which in your case(with QOS) would'nt needed From:"Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu> To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> Date:05/04/2017 05:44 PM Subject:Re: [gpfsug-discuss] Well, this is the pits... Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi Olaf, Your explanation most
Re: [gpfsug-discuss] Well, this is the pits...
Hi, >>I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 4.2.0.3 and are gradually being upgraded). What version of GPFS fixes this? With what I’m doing I need the ability to run mmrestripefs. GPFS version 4.2.3.0 (and above) fixes this issue and supports "sum of pitWorkerThreadsPerNode of the participating nodes (-N parameter to mmrestripefs)" to exceed 31. If you are using 4.2.2.3, then depending on "number of nodes participating in the mmrestripefs" then the GPFS config parameter "pitWorkerThreadsPerNode" need to be adjusted such that "sum of pitWorkerThreadsPerNode of the participating nodes <= 31". For example, if "number of nodes participating in the mmrestripefs" is 6 then adjust "mmchconfig pitWorkerThreadsPerNode=5 -N ". GPFS would need to be restarted for this parameter to take effect on the participating_nodes (verify with mmfsadm dump config | grep pitWorkerThreadsPerNode) Regards, -Kums From: "Buterbaugh, Kevin L"To: gpfsug main discussion list Date: 05/04/2017 12:08 PM Subject:Re: [gpfsug-discuss] Well, this is the pits... Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi Olaf, I didn’t touch pitWorkerThreadsPerNode … it was already zero. I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 4.2.0.3 and are gradually being upgraded). What version of GPFS fixes this? With what I’m doing I need the ability to run mmrestripefs. It seems to me that mmrestripefs could check whether QOS is enabled … granted, it would have no way of knowing whether the values used actually are reasonable or not … but if QOS is enabled then “trust” it to not overrun the system. PMR time? Thanks.. Kevin On May 4, 2017, at 10:54 AM, Olaf Weiser wrote: HI Kevin, the number of NSDs is more or less nonsense .. it is just the number of nodes x PITWorker should not exceed to much the #mutex/FS block did you adjust/tune the PitWorker ? ... so far as I know.. that the code checks the number of NSDs is already considered as a defect and will be fixed / is already fixed ( I stepped into it here as well) ps. QOS is the better approach to address this, but unfortunately.. not everyone is using it by default... that's why I suspect , the development decide to put in a check/limit here .. which in your case(with QOS) would'nt needed From:"Buterbaugh, Kevin L" To:gpfsug main discussion list Date:05/04/2017 05:44 PM Subject:Re: [gpfsug-discuss] Well, this is the pits... Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi Olaf, Your explanation mostly makes sense, but... Failed with 4 nodes … failed with 2 nodes … not gonna try with 1 node. And this filesystem only has 32 disks, which I would imagine is not an especially large number compared to what some people reading this e-mail have in their filesystems. I thought that QOS (which I’m using) was what would keep an mmrestripefs from overrunning the system … QOS has worked extremely well for us - it’s one of my favorite additions to GPFS. Kevin On May 4, 2017, at 10:34 AM, Olaf Weiser wrote: no.. it is just in the code, because we have to avoid to run out of mutexs / block reduce the number of nodes -N down to 4 (2nodes is even more safer) ... is the easiest way to solve it for now I've been told the real root cause will be fixed in one of the next ptfs .. within this year .. this warning messages itself should appear every time.. but unfortunately someone coded, that it depends on the number of disks (NSDs).. that's why I suspect you did'nt see it before but the fact , that we have to make sure, not to overrun the system by mmrestripe remains.. to please lower the -N number of nodes to 4 or better 2 (even though we know.. than the mmrestripe will take longer) From:"Buterbaugh, Kevin L" To:gpfsug main discussion list Date:05/04/2017 05:26 PM Subject:[gpfsug-discuss] Well, this is the pits... Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi All, Another one of those, “I can open a PMR if I need to” type questions… We are in the process of combining two large GPFS filesystems into one new filesystem (for various reasons I won’t get into here). Therefore, I’m doing a lot of mmrestripe’s, mmdeldisk’s, and mmadddisk’s. Yesterday I did an “mmrestripefs -r -N ” (after suspending a disk, of course). Worked like it should. Today I did a “mmrestripefs -b -P capacity -N ” and got: mmrestripefs: The total number of PIT worker threads of all participating nodes has been exceeded to safely restripe the file system. The total number of PIT worker threads, which is the sum of
Re: [gpfsug-discuss] RAID config for SSD's - potential pitfalls
Hi, >> As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: >>Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. >>This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. As you pointed out, the RAID choices for GPFS may not be simple and we need to take into consideration factors such as storage subsystem configuration/capabilities such as if all drives are homogenous or there is mix of drives. If all the drives are homogeneous, then create dataAndMetadata NSDs across RAID-6 and if the storage controller supports write-cache + write-cache mirroring (WC + WM) then enable this (WC +WM) can alleviate read-modify-write for small writes (typical in metadata). If there is MIX of SSD and HDD (e.g. 15K RPM), then we need to take into consideration the aggregate IOPS of RAID-1 SSD volumes vs. RAID-6 HDDs before separating data and metadata into separate media. For example, if the storage subsystem has 2 x SSDs and ~300 x 15K RPM or NL_SAS HDDs then most likely aggregate IOPS of RAID-6 HDD volumes will be higher than RAID-1 SSD volumes. It would be recommended to also assess the I/O performance on different configuration (dataAndMetadata vs dataOnly/metadataOnly NSDs) with some application workload + production scenarios before deploying the final solution. >> GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more >>robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. For high-resiliency (for e.g. metadataOnly) and if there are multiple storage across different failure domains (different racks/rooms/DC etc), it will be good to enable BOTH hardware RAID-1 as well as GPFS metadata replication enabled (at the minimum, -m 2). If there is single shared storage for GPFS file-system storage and metadata is separated from data, then RAID-1 would minimize administrative overhead compared to GPFS replication in the event of drive failure (since with GPFS replication across single SSD would require mmdeldisk/mmdelnsd/mmcrnsd/mmadddisk every time disk goes faulty and needs to be replaced). Best, -Kums From: Marc A Kaplan/Watson/IBM@IBMUS To: gpfsug main discussion listDate: 04/19/2017 04:50 PM Subject:Re: [gpfsug-discuss] RAID config for SSD's - potential pitfalls Sent by:gpfsug-discuss-boun...@spectrumscale.org As I've mentioned before, RAID choices for GPFS are not so simple.Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at
Re: [gpfsug-discuss] question on viewing block distribution across NSDs
Hi, Yes, you could use "mmdf" to obtain file-system "usage" across the NSDs (comprising the file-system). If you want to obtain "data block distribution corresponding to a file across the NSDs", then there is a utility "mmgetlocation" in /usr/lpp/mmfs/samples/fpo that can be used to get file-data-blocks to NSD mapping. Example: # File-system comprises of single storage pool, all NSDs configured as dataAndMetadata, -m 1 -r 1, FS block-size=2MiB # mmlsfs gpfs1b | grep 'Block size' -B 2097152 Block size # The file-system is comprised of 10 x dataAndMetadata NSDs # mmlsdisk gpfs1b | grep DMD | wc -l 10 # Create a sample file that is 40MiB (20 data blocks) /mnt/sw/benchmarks/gpfsperf/gpfsperf create seq -r 2m -n 40m /mnt/gpfs1b/temp_dir/lf.s.1 # File size is 40 MiB (09:52:49) c25m3n07:~ # ls -lh /mnt/gpfs1b/temp_dir/lf.s.1 -rw-r--r-- 1 root root 40M Mar 17 09:52 /mnt/gpfs1b/temp_dir/lf.s.1 (09:52:54) c25m3n07:~ # du -sh /mnt/gpfs1b/temp_dir/lf.s.1 40M /mnt/gpfs1b/temp_dir/lf.s.1 # Verified through mmgetlocation that the file data blocks is uniformly striped across all the dataAndMetadata NSDs, with each NSD containing 2 file data blocks # In the output below, "DMD_NSDX" is name of the NSDs. (09:53:00) c25m3n07:~ # /usr/lpp/mmfs/samples/fpo/mmgetlocation -f /mnt/gpfs1b/temp_dir/lf.s.1 [FILE INFO] blockSize 2 MB blockGroupFactor1 metadataBlockSize 2M writeAffinityDepth 0 flags: data replication: 1 max 2 storage pool name:system metadata replication: 1 max 2 Chunk 0 (offset 0) is located at disks: [ DMD_NSD09 c25m3n07-ib,c25m3n08-ib ] Chunk 1 (offset 2097152) is located at disks: [ DMD_NSD10 c25m3n08-ib,c25m3n07-ib ] Chunk 2 (offset 4194304) is located at disks: [ DMD_NSD01 c25m3n07-ib,c25m3n08-ib ] Chunk 3 (offset 6291456) is located at disks: [ DMD_NSD02 c25m3n08-ib,c25m3n07-ib ] Chunk 4 (offset 8388608) is located at disks: [ DMD_NSD03 c25m3n07-ib,c25m3n08-ib ] Chunk 5 (offset 10485760) is located at disks: [ DMD_NSD04 c25m3n08-ib,c25m3n07-ib ] Chunk 6 (offset 12582912) is located at disks: [ DMD_NSD05 c25m3n07-ib,c25m3n08-ib ] Chunk 7 (offset 14680064) is located at disks: [ DMD_NSD06 c25m3n08-ib,c25m3n07-ib ] Chunk 8 (offset 16777216) is located at disks: [ DMD_NSD07 c25m3n07-ib,c25m3n08-ib ] Chunk 9 (offset 18874368) is located at disks: [ DMD_NSD08 c25m3n08-ib,c25m3n07-ib ] Chunk 10 (offset 20971520) is located at disks: [ DMD_NSD09 c25m3n07-ib,c25m3n08-ib ] Chunk 11 (offset 23068672) is located at disks: [ DMD_NSD10 c25m3n08-ib,c25m3n07-ib ] Chunk 12 (offset 25165824) is located at disks: [ DMD_NSD01 c25m3n07-ib,c25m3n08-ib ] Chunk 13 (offset 27262976) is located at disks: [ DMD_NSD02 c25m3n08-ib,c25m3n07-ib ] Chunk 14 (offset 29360128) is located at disks: [ DMD_NSD03 c25m3n07-ib,c25m3n08-ib ] Chunk 15 (offset 31457280) is located at disks: [ DMD_NSD04 c25m3n08-ib,c25m3n07-ib ] Chunk 16 (offset 33554432) is located at disks: [ DMD_NSD05 c25m3n07-ib,c25m3n08-ib ] Chunk 17 (offset 35651584) is located at disks: [ DMD_NSD06 c25m3n08-ib,c25m3n07-ib ] Chunk 18 (offset 37748736) is located at disks: [ DMD_NSD07 c25m3n07-ib,c25m3n08-ib ] Chunk 19 (offset 39845888) is located at disks: [ DMD_NSD08 c25m3n08-ib,c25m3n07-ib ] [SUMMARY INFO] -- Replica num Nodename TotalChunkst Replica 1 : c25m3n07-ib,c25m3n08-ib: Total : 10 Replica 1 : c25m3n08-ib,c25m3n07-ib: Total : 10 Best Regards, -Kums From:To: Date: 03/29/2017 08:00 PM Subject:Re: [gpfsug-discuss] question on viewing block distribution across NSDs Sent by:gpfsug-discuss-boun...@spectrumscale.org I was going to keep mmdf in mind, not gpfs.snap. I will now also keep in mind that mmdf can have an impact as at present we have spinning disk for metadata. The system I am playing around on is not production yet, so I am safe for the moment. Thanks again. From: gpfsug-discuss-boun...@spectrumscale.org [ mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] Sent: Thursday, 30 March 2017 9:55 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] question on viewing block distribution across NSDs I don't necessarily think you need to run a snap prior, just the output of mmdf should be enough. Something to keep in mind that I should have said before-- an mmdf can be stressful on your system particularly if you have spinning disk for your metadata. We're fortunate enough to have all flash for our metadata and I tend to take it for granted some times :) From: greg.lehm...@csiro.au Sent: 3/29/17, 19:52 To: gpfsug main discussion list Subject: