Re: [gpfsug-discuss] Running the Spectrum Scale on a Compute-only Cluster ?

2022-03-11 Thread Kumaran Rajaram
Hi,

>> The Spectrum Scale GUI is widely run on GPFS clusters that include their own 
>> storage, but what about multi-cluster with separate Storage and Compute 
>> clusters?

Yes, Spectrum Scale GUI works on GPFS Multi-Cluster setup with separate Storage 
and Compute clusters.

The GUI works fine on a compute only cluster and storage only cluster.

>> And if so does it simply omit components that are not present - such as 
>> Recovery Groups and NSD Servers ?

Correct. The compute cluster GUI panels will still show remotely mounted file 
system and filesets + show their health status.

Cheers,
-Kums

Kumaran Rajaram
[cid:image001.png@01D83549.C3601B90]

From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Kidger, Daniel
Sent: Friday, March 11, 2022 12:34 PM
To: gpfsug-discuss@spectrumscale.org
Subject: [gpfsug-discuss] Running the Spectrum Scale on a Compute-only Cluster ?

The Spectrum Scale GUI is widely run on GPFS clusters that include their own 
storage, but what about multi-cluster with separate Storage and Compute 
clusters?

Will the GUI run on a compute only cluster?
And if so does it simply omit components that are not present - such as 
Recovery Groups and NSD Servers ?



Daniel

Daniel Kidger
HPC Storage Solutions Architect, EMEA
daniel.kid...@hpe.com<mailto:daniel.kid...@hpe.com>
+44 (0)7818 522266

hpe.com<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.hpe.com%2F=04%7C01%7Ckrajaram%40geocomputing.net%7Cf06fe96e62f547943a7908da0387b381%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637826179064346264%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000=1St%2BPcHkdNSeWnQea10AFKx7JOUsO4oql%2FLcLm5z3IA%3D=0>


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] IO sizes

2022-02-24 Thread Kumaran Rajaram
Hi Uwe,

>> But what puzzles me even more: one of the server compiles IOs even smaller, 
>> varying between 3.2MiB and 3.6MiB mostly - both for reads and writes ... I 
>> just cannot see why.

IMHO, If GPFS on this particular NSD server was restarted often during the 
setup, then it is possible that the GPFS pagepool may not be contiguous. As a 
result, GPFS 8MiB buffer in the pagepool might be a scatter-gather (SG) list 
with many small entries (in the memory) resulting in smaller I/O when these 
buffers are issued to the disks. The fix would be to reboot the server and 
start GPFS so that pagepool is contiguous resulting in 8MiB buffer to be 
comprised of 1 (or fewer) SG entries.

>>In the current situation (i.e. with IOs bit larger than 4MiB) setting 
>>max_sectors_kB to 4096 might do the trick, but as I do not know the cause for 
>>that behaviour it might well start to issue IOs >>smaller than 4MiB again at 
>>some point, so that is not a nice solution.
It will be advised not to restart GPFS often in the NSD servers (in production) 
to keep the pagepool contiguous. Ensure that there is enough free memory in NSD 
server and not run any memory intensive jobs so that pagepool is not impacted 
(e.g. swapped out).

Also, enable GPFS numaMemoryInterleave=yes and verify that pagepool is equally 
distributed across the NUMA domains for good performance. GPFS 
numaMemoryInterleave=yes requires that numactl packages are installed and then 
GPFS restarted.

# mmfsadm dump config | egrep "numaMemory|pagepool "
! numaMemoryInterleave yes
! pagepool 282394099712

# pgrep mmfsd | xargs numastat -p

Per-node process memory usage (in MBs) for PID 2120821 (mmfsd)
   Node 0  Node 1   Total
  --- --- ---
Huge 0.000.000.00
Heap 1.263.264.52
Stack0.010.010.02
Private 137710.43   137709.96   275420.39
  --- --- ---
Total   137711.70   137713.23   275424.92

My two cents,
-Kums

Kumaran Rajaram
[cid:image001.png@01D82960.6A9860C0]

From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Uwe Falke
Sent: Wednesday, February 23, 2022 8:04 PM
To: gpfsug-discuss@spectrumscale.org
Subject: Re: [gpfsug-discuss] IO sizes


Hi,

the test bench is gpfsperf running on up to 12 clients with 1...64 threads 
doing sequential reads and writes , file size per gpfsperf process is 12TB 
(with 6TB I saw caching effects in particular for large thread numbers ...)

As I wrote initially: GPFS is issuing nothing but 8MiB IOs to the data disks, 
as expected in that case.

Interesting thing though:

I have rebooted the suspicious node. Now, it does not issue smaller IOs than 
the others, but -- unbelievable -- larger ones (up to about 4.7MiB). This is 
still harmful as also that size is incompatible with full stripe writes on the 
storage ( 8+2 disk groups, i.e. logically RAID6)

Currently, I draw this information from the storage boxes; I have not yet 
checked iostat data for that benchmark test after the reboot (before, when IO 
sizes were smaller, we saw that both in iostat and in the perf data retrieved 
from the storage controllers).



And: we have a separate data pool , hence dataOnly NSDs, I am just talking 
about these ...



As for "Are you sure that Linux OS is configured the same on all 4 NSD 
servers?." - of course there are not two boxes identical in the world. I have 
actually not installed those machines, and, yes, i also considered reinstalling 
them (or at least the disturbing one).

However, I do not have reason to assume or expect a difference, the supplier 
has just implemented these systems  recently from scratch.



In the current situation (i.e. with IOs bit larger than 4MiB) setting 
max_sectors_kB to 4096 might do the trick, but as I do not know the cause for 
that behaviour it might well start to issue IOs smaller than 4MiB again at some 
point, so that is not a nice solution.



Thanks

Uwe


On 23.02.22 22:20, Andrew Beattie wrote:
Alex,

Metadata will be 4Kib

Depending on the filesystem version you will also have subblocks to consider V4 
filesystems have 1/32 subblocks, V5 filesystems have 1/1024 subblocks (assuming 
metadata and data block size is the same)

My first question would be is “ Are you sure that Linux OS is configured the 
same on all 4 NSD servers?.

My second question would be do you know what your average file size is if most 
of your files are smaller than your filesystem block size, then you are always 
going to be performing writes using groups of subblocks rather than a full 
block writes.

Regards,

Andrew



On 24 Feb 2022, at 04:39, Alex Chekholko 
<mailto:a...@calicolabs.com> wrote:

Re: [gpfsug-discuss] du --apparent-size and quota

2021-06-01 Thread Kumaran Rajaram
Hi,

>> If I'm not mistaken even with SS5 created filesystems, 1 MiB FS block size 
>> implies 32 kiB sub blocks (32 sub-blocks).

Just to add: The /srcfilesys seemed to have been created with GPFS version 4.x 
which supports only 32 sub-blocks per block.

-T /srcfilesys  Default mount point
-V 16.00 (4.2.2.0)  Current file system version
   14.10 (4.1.0.4)  Original file system version
--create-time  Tue Feb  3 11:46:10 2015 File system creation time
-B 1048576  Block size
-f 32768Minimum fragment (subblock) size in 
bytes
--subblocks-per-full-block 32   Number of subblocks per full block


The /dstfilesys was created with GPFS version 5.x which support greater than 32 
subblocks per block. /dstfilesys does have 512 subblocks-per-full-block with 
8KiB subblock size since file-system blocksize is 4MiB. 


-T /dstfilesys  Default mount point
-V 23.00 (5.0.5.0)  File system version
--create-time  Tue May 11 16:51:27 2021 File system creation time
-B 4194304  Block size
-f 8192 Minimum fragment (subblock) size in 
bytes
--subblocks-per-full-block 512  Number of subblocks per full block

Hope this helps,
-Kums



-Original Message-
From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Loic Tortay
Sent: Tuesday, June 1, 2021 10:57 AM
To: gpfsug main discussion list ; Ulrich 
Sibiller ; gpfsug-disc...@gpfsug.org
Subject: Re: [gpfsug-discuss] du --apparent-size and quota

On 6/1/21 4:26 PM, Ulrich Sibiller wrote:
[...]
> )
> 
> While trying to understand what's going on here I found this on the 
> source file system (which is valid for all files, with different 
> number of course):
> 
> $ du --block-size 1 /srcfilesys/fileset/filename
> 65536   /srcfilesys/fileset/filename
> 
> $ du --apparent-size --block-size 1 /srcfilesys/fileset/filename
> 3994    /srcfilesys/fileset/filename
> 
> $ stat /srcfilesys/fileset/filename
>   File: ‘/srcfilesys/fileset/filename’
>   Size: 3994    Blocks: 128    IO Block: 1048576 regular 
> file
> Device: 2ah/42d Inode: 23266095    Links: 1
> Access: (0660/-rw-rw)  Uid: (73018/ cpunnoo)   Gid: (50070/  
> dc-rti)
> Context: system_u:object_r:unlabeled_t:s0
> Access: 2021-05-12 20:10:13.814459305 +0200
> Modify: 2020-07-16 11:08:41.631006000 +0200
> Change: 2020-07-16 11:08:41.630896177 +0200
>  Birth: -
> 
Hello,
This looks like the sub-block overhead.
If I'm not mistaken even with SS5 created filesystems, 1 MiB FS block size 
implies 32 kiB sub blocks (32 sub-blocks).
The sub-block is the minimum disk allocation for files (if the file content is 
too large to be kept in the inode, when that is supported on the specific GPFS 
filesystem).

The "Blocks" value displayed by "stat" is in 512 bytes unit, so 128*512 = 65536 
(which is consistent with "du"): two 32 kiB sub-blocks due to data replication.

The "--apparent-size" option to "du" uses the user visible size not the actual 
disk usage (per the man page), so 3994 is also consistent w/ "stat" output.

AFAIK, GPFS space quotas count the sub-blocks not the apparent sizes, so again 
this would be consistent with the overhead.

Beside the overhead, hard-links in the source FS (which, if I'm not mistaken, 
are not handled by "rsync" unless you specify "-H") and in some cases spare 
files can also explain the differences.


Loïc.
-- 
|   Loïc Tortay  - IN2P3 Computing Centre  |
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Spectrum Scale - how to get RPO=0

2021-05-24 Thread Kumaran Rajaram
Hi Tom,

>>we are trying to implement a mixed linux/windows environment and we have one 
>>thing at the top - is there any global method to avoid asynchronous I/O and 
>>write everything in >>synchronous mode?

If the local and remote sites have good inter-site network bandwidth and 
low-latency, then you may consider using GPFS synchronous replication at the 
file-system level (-m 2 -r 2).  The Spectrum Scale documentation (link below) 
has further details.

https://www.ibm.com/docs/en/spectrum-scale/5.1.0?topic=data-synchronous-mirroring-gpfs-replication

Regards,
-Kums

From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Tomasz Rachobinski
Sent: Monday, May 24, 2021 9:06 AM
To: gpfsug-discuss@spectrumscale.org
Subject: [gpfsug-discuss] Spectrum Scale - how to get RPO=0

Hello everyone,
we are trying to implement a mixed linux/windows environment and we have one 
thing at the top - is there any global method to avoid asynchronous I/O and 
write everything in synchronous mode?
Another thing is - if there is no global sync setting, how to enforce sync i/o 
from linux/windows client?

Greetings
Tom Rachobinski
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Client Latency and High NSD Server Load Average

2020-06-04 Thread Kumaran Rajaram

Hi,

 >> I do notice nsd03/nsd04 have long waiters, but nsd01 doesn't (nsd02-ib
 is offline for now):

Please issue "mmlsdisk  -m" in NSD client to ascertain the active NSD
server serving a NSD. Since nsd02-ib is offlined, it is possible that some
servers would be serving higher NSDs than the rest.

https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_PoorPerformanceDuetoDiskFailure.htm
https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_HealthStateOfNSDserver.htm

>> From the waiters you provided I would guess there is something amiss
with some of your storage systems.

Please ensure there are no "disk rebuild" pertaining to certain
NSDs/storage volumes in progress (in the storage subsystem) as this can
sometimes impact block-level performance and thus impact latency,
especially for write operations. Please ensure that the hardware components
constituting the Spectrum Scale stack are healthy and performing optimally.

https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_pspduetosyslevelcompissue.htm

Please refer to the Spectrum Scale documentation (link below) for potential
causes (e.g. Scale maintenance operation such as mmapplypolicy/mmestripefs
in progress, slow disks)  that can be contributing to this issue:

https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_performanceissues.htm

Thanks and Regards,
-Kums

Kumaran Rajaram
Spectrum Scale Development, IBM Systems
k...@us.ibm.com




From:   "Frederick Stock" 
To: gpfsug-discuss@spectrumscale.org
Cc: gpfsug-discuss@spectrumscale.org
Date:   06/04/2020 07:08 AM
Subject:[EXTERNAL] Re: [gpfsug-discuss] Client Latency and High NSD
Server Load Average
Sent by:gpfsug-discuss-boun...@spectrumscale.org



>From the waiters you provided I would guess there is something amiss with
some of your storage systems.  Since those waiters are on NSD servers they
are waiting for IO requests to the kernel to complete.  Generally IOs are
expected to complete in milliseconds, not seconds.  You could look at the
output of "mmfsadm dump nsd" to see how the GPFS IO queues are working but
that would be secondary to checking your storage systems.

Fred
__
Fred Stock | IBM Pittsburgh Lab | 720-430-8821
sto...@us.ibm.com


 - Original message -
 From: "Saula, Oluwasijibomi" 
 Sent by: gpfsug-discuss-boun...@spectrumscale.org
 To: "gpfsug-discuss@spectrumscale.org" 
 Cc:
 Subject: [EXTERNAL] Re: [gpfsug-discuss] Client Latency and High NSD
 Server Load Average
 Date: Wed, Jun 3, 2020 6:24 PM

 Frederick,

 Yes on both counts! -  mmdf is showing pretty uniform (ie 5 NSDs out of 30
 report 65% free; All others are uniform at 58% free)...

 NSD servers per disks are called in round-robin fashion as well, for
 example:

  gpfs1 tier2_001nsd02-ib,nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib
  gpfs1 tier2_002nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib
  gpfs1 tier2_003nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib
  gpfs1 tier2_004tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib,nsd04-ib


 Any other potential culprits to investigate?

 I do notice nsd03/nsd04 have long waiters, but nsd01 doesn't (nsd02-ib is
 offline for now):
 [nsd03-ib ~]# mmdiag --waiters
 === mmdiag: waiters ===
 Waiting 6.5113 sec since 17:17:33, monitored, thread 4175 NSDThread: for
 I/O completion
 Waiting 6.3810 sec since 17:17:33, monitored, thread 4127 NSDThread: for
 I/O completion
 Waiting 6.1959 sec since 17:17:34, monitored, thread 4144 NSDThread: for
 I/O completion

 nsd04-ib:

 Waiting 13.1386 sec since 17:19:09, monitored, thread 9971 NSDThread: for
 I/O completion
 Waiting 10.3562 sec since 17:19:12, monitored, thread 9958 NSDThread: for
 I/O completion
 Waiting 10.0338 sec since 17:19:12, monitored, thread 9951 NSDThread: for
 I/O completion



 tsm01-ib:

 Waiting 8.1211 sec since 17:20:24, monitored, thread 3644 NSDThread: for
 I/O completion
 Waiting 7.6690 sec since 17:20:24, monitored, thread 3641 NSDThread: for
 I/O completion
 Waiting 7.4969 sec since 17:20:24, monitored, thread 3658 NSDThread: for
 I/O completion
 Waiting 7.3573 sec since 17:20:24, monitored, thread 3642 NSDThread: for
 I/O completion



 nsd01-ib:

 Waiting 0.2548 sec since 17:21:47, monitored, thread 30513 NSDThread: for
 I/O completion
 Waiting 0.1502 sec since 17:21:47, monitored, thread 30529 NSDThread: for
 I/O completion








 Thanks,

 Oluwasijibomi (Siji) Saula


 HPC Systems Administrator  /  Information Technology





 Research 2 Building 220B / Fargo ND 58108-6050


 p: 701.231.7749 / www.ndsu.edu














 From: gpfsug-discuss-boun...@spectrumscale.org
  on behalf of
 gpfsug-discuss-requ...@spectrumscale.org
 
 Sent: Wednesday, June 3, 2

Re: [gpfsug-discuss] How to prove that data is in inode

2019-07-17 Thread Kumaran Rajaram
Hi,

>> How can I prove that data of a small file is stored in the inode (and
not on a data nsd)?

You may use echo "inode file_inode_number" | tsdbfs fs_device | grep
indirectionLevel and if it points to INODE, then the file is stored in the
inodes

# 4K Inode Size
# mmlsfs gpfs3a | grep 'Inode size'
 -i 4096 Inode size in bytes

# Small file
# ls -l /mnt/gpfs3a/hello.txt
-rw-r--r-- 1 root root 6 Jul 17 08:32 /mnt/gpfs3a/hello.txt

# ls -i /mnt/gpfs3a/hello.txt
91649 /mnt/gpfs3a/hello.txt

#File is inlined within Inode
# echo "inode 91649" | tsdbfs gpfs3a | grep indirectionLevel
  indirectionLevel=INODE status=USERFILE

Regards,
-Kums





From:   "Billich  Heinrich Rainer (ID SD)"

To: gpfsug main discussion list 
Date:   07/17/2019 07:49 AM
Subject:[EXTERNAL] [gpfsug-discuss] How to prove that data is in inode
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hello,

How can I prove that data of a small file is stored in the inode (and not
on a data nsd)?

We have a filesystem with 4k inodes on Scale 5.0.2 , but it seems there is
no file data in the inodes?

I would expect that 'stat' reports 'Blocks: 0'  for a small file, but I see
'Blocks:1'.

Cheers,

Heiner

I tried

[]# rm -f test; echo hello > test
[]# ls -ls test
1 -rw-r--r-- 1 root root 6 Jul 17 13:11 test
[root@testnas13ems01 test]# stat test
  File: ‘test’
  Size: 6Blocks: 1  IO Block: 1048576 regular
file
Device: 2dh/45d  Inode: 353314  Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
Access: 2019-07-17 13:11:03.037049000 +0200
Modify: 2019-07-17 13:11:03.037331000 +0200
Change: 2019-07-17 13:11:03.037259319 +0200
 Birth: -
[root@testnas13ems01 test]# du test
1test
[root@testnas13ems01 test]# du -b test
6test
[root@testnas13ems01 test]#

Filesystem

# mmlsfs f
flagvaluedescription
--- 
---
 -f 32768Minimum fragment (subblock)
size in bytes
 -i 4096 Inode size in bytes
 -I 32768Indirect block size in bytes
 -m 1Default number of metadata
replicas
 -M 2Maximum number of metadata
replicas
 -r 1Default number of data
replicas
 -R 2Maximum number of data
replicas
 -j cluster  Block allocation type
 -D nfs4 File locking semantics in
effect
 -k nfs4 ACL semantics in effect
 -n 32   Estimated number of nodes that
will mount file system
 -B 1048576  Block size
 -Q user;group;fileset   Quotas accounting enabled
user;group;fileset   Quotas enforced
user;group;fileset   Default quotas enabled
 --perfileset-quota Yes  Per-fileset quota enforcement
 --filesetdfYes  Fileset df enabled?
 -V 20.01 (5.0.2.0)  Current file system version
15.01 (4.2.0.0)  Original file system version
 --create-time  * 2017 File system creation time
 -z No   Is DMAPI enabled?
 -L 33554432 Logfile size
 -E Yes  Exact mtime mount option
 -S relatime Suppress atime mount option
 -K whenpossible Strict replica allocation
option
 --fastea   Yes  Fast external attributes
enabled?
 --encryption   No   Encryption enabled?
 --inode-limit  1294592  Maximum number of inodes in
all inode spaces
 --log-replicas 0Number of log replicas
 --is4KAligned  Yes  is4KAligned?
 --rapid-repair Yes  rapidRepair enabled?
 --write-cache-threshold 0   HAWC Threshold (max 65536)
 --subblocks-per-full-block 32   Number of subblocks per full
block
 -P system;data  Disk storage pools in file
system
 --file-audit-log   No   File Audit Logging enabled?
 --maintenance-mode No   Maintenance Mode enabled?
 -d **
-A yes  Automatic mount option
 -o nfssync,nodevAdditional mount options
 -T /  Default mount point
 --mount-priority   0Mount priority

--
===
Heinrich 

Re: [gpfsug-discuss] verbs status not working in 5.0.2

2019-06-11 Thread Kumaran Rajaram

Hi,

This issue is resolved in the latest 5.0.3.1 release.

# mmfsadm dump version | grep Build
Build branch "5.0.3.1 ".

# mmfsadm test verbs status
VERBS RDMA status: started

Regards,
-Kums





From:   Ryan Novosielski 
To: "gpfsug-discuss@spectrumscale.org"

Date:   06/11/2019 03:46 PM
Subject:[EXTERNAL] Re: [gpfsug-discuss] verbs status not working in
5.0.2
Sent by:gpfsug-discuss-boun...@spectrumscale.org



-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thanks -- this was originally how Lenovo told us to check this, and I
came across `mmfsadm test verbs status` on my own.

I'm thinking, though, isn't there some risk that if RDMA went down
somehow, that wouldn't be caught by your script? I can't say that I
normally see that as the failure mode (it's most often booting up
without), nor do I know what happens to `mmfsadm test verbs status` if
you pull a cable or something.

On 6/11/19 3:37 PM, Bryan Banister wrote:
> This has been brocket for a long time... we too were checking that
> `mmfsadm test verbs status` reported that RDMA is working.  We
> don't want nodes that are not using RDMA running in the cluster.
>
> We have decided to just look for the log entry like this:
> test_gpfs_rdma_active() { [[ "$(grep -c "VERBS RDMA started"
> /var/adm/ras/mmfs.log.latest)" == "1" ]] }
>
> Hope that helps, -Bryan

- --
 
 || \\UTGERS, |--*O*
 ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
 || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
 ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
  `'
-BEGIN PGP SIGNATURE-

iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXQAE3gAKCRCZv6Bp0Ryx
vpvpAJ9KnVX79aXNu3oclxM6swYfZ5wKjQCeJF3s94tS7+2JtTlkc5OXV/E8LnI=
=kBtE
-END PGP SIGNATURE-
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] NSD network checksums (nsdCksumTraditional)

2018-10-29 Thread Kumaran Rajaram
In non-GNR setup, nsdCksumTraditional=yes enables data-integrity checking 
between a traditional NSD client node and its NSD server, at the network 
level only.

The ESS storage supports end-to-end checksum, NSD client to the ESS IO 
servers (at the network level) as well as from  ESS IO servers to the 
disk/storage.  This is further detailed in the docs (link below):

https://www.ibm.com/support/knowledgecenter/en/SSYSP8_5.3.1/com.ibm.spectrum.scale.raid.v5r01.adm.doc/bl1adv_introe2echecksum.htm

Best,
-Kums





From:   Stephen Ulmer 
To: gpfsug main discussion list 
Date:   10/29/2018 04:52 PM
Subject:Re: [gpfsug-discuss] NSD network checksums 
(nsdCksumTraditional)
Sent by:gpfsug-discuss-boun...@spectrumscale.org



So the ESS checksums that are highly touted as "protecting all the way to 
the disk surface" completely ignore the transfer between the client and 
the NSD server? It sounds like you are saying that all of the checksumming 
done for GNR is internal to GNR and only protects against bit-flips on the 
disk (and in staging buffers, etc.)

I’m asking because your explanation completely ignores calculating 
anything on the NSD client and implies that the client could not 
participate, given that it does not know about the structure of the vdisks 
under the NSD — but that has to be a performance factor for both types if 
the transfer is protected starting at the client — which it is in the case 
of nsdCksumTraditional which is what we are comparing to ESS checksumming.

If ESS checksumming doesn’t protect on the wire I’d say that marketing has 
run amok, because that has *definitely* been implied in meetings for which 
I’ve been present. In fact, when asked if Spectrum Scale provides 
checksumming for data in-flight, IBM sales has used it as an ESS up-sell 
opportunity.

-- 
Stephen



On Oct 29, 2018, at 3:56 PM, Kumaran Rajaram  wrote:

Hi,

>>How can it be that the I/O performance degradation warning only seems to 
accompany the nsdCksumTraditional setting and not GNR?
>>Why is there such a penalty for "traditional" environments?

In GNR IO/NSD servers (ESS IO nodes), the checksums are computed in 
parallel  for a NSD (storage volume/vdisk) across the threads handling 
each pdisk/drive (that constitutes the vdisk/volume). This is possible 
since the GNR software on the ESS IO servers is tightly integrated with 
underlying storage and is aware of the vdisk DRAID configuration 
(strip-size, pdisk constituting vdisk etc.) to perform parallel checksum 
operations.  

In non-GNR + external storage model, the GPFS software on the NSD 
server(s) does not manage the underlying storage volume (this is done by 
storage RAID controllers)  and the checksum is computed serially. This 
would contribute to increase in CPU usage and I/O performance degradation 
(depending on I/O access patterns, I/O load etc).

My two cents.

Regards,
-Kums





From:Aaron Knister 
To:gpfsug main discussion list 
Date:10/29/2018 12:34 PM
Subject:[gpfsug-discuss] NSD network checksums 
(nsdCksumTraditional)
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Flipping through the slides from the recent SSUG meeting I noticed that 
in 5.0.2 one of the features mentioned was the nsdCksumTraditional flag. 
Reading up on it it seems as though it comes with a warning about 
significant I/O performance degradation and increase in CPU usage. I 
also recall that data integrity checking is performed by default with 
GNR. How can it be that the I/O performance degradation warning only 
seems to accompany the nsdCksumTraditional setting and not GNR? As 
someone who knows exactly 0 of the implementation details, I'm just 
naively assuming that the checksum are being generated (in the same 
way?) in both cases and transferred to the NSD server. Why is there such 
a penalty for "traditional" environments?

-Aaron

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] NSD network checksums (nsdCksumTraditional)

2018-10-29 Thread Kumaran Rajaram
Hi,

>>How can it be that the I/O performance degradation warning only seems to 
accompany the nsdCksumTraditional setting and not GNR?
>>Why is there such a penalty for "traditional" environments?

In GNR IO/NSD servers (ESS IO nodes), the checksums are computed in 
parallel  for a NSD (storage volume/vdisk) across the threads handling 
each pdisk/drive (that constitutes the vdisk/volume). This is possible 
since the GNR software on the ESS IO servers is tightly integrated with 
underlying storage and is aware of the vdisk DRAID configuration 
(strip-size, pdisk constituting vdisk etc.) to perform parallel checksum 
operations. 

In non-GNR + external storage model, the GPFS software on the NSD 
server(s) does not manage the underlying storage volume (this is done by 
storage RAID controllers)  and the checksum is computed serially. This 
would contribute to increase in CPU usage and I/O performance degradation 
(depending on I/O access patterns, I/O load etc).

My two cents.

Regards,
-Kums





From:   Aaron Knister 
To: gpfsug main discussion list 
Date:   10/29/2018 12:34 PM
Subject:[gpfsug-discuss] NSD network checksums 
(nsdCksumTraditional)
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Flipping through the slides from the recent SSUG meeting I noticed that 
in 5.0.2 one of the features mentioned was the nsdCksumTraditional flag. 
Reading up on it it seems as though it comes with a warning about 
significant I/O performance degradation and increase in CPU usage. I 
also recall that data integrity checking is performed by default with 
GNR. How can it be that the I/O performance degradation warning only 
seems to accompany the nsdCksumTraditional setting and not GNR? As 
someone who knows exactly 0 of the implementation details, I'm just 
naively assuming that the checksum are being generated (in the same 
way?) in both cases and transferred to the NSD server. Why is there such 
a penalty for "traditional" environments?

-Aaron

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss





___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Tuning: single client, single thread, small files - native Scale vs NFS

2018-10-15 Thread Kumaran Rajaram
Hi Alexander,

1. >>When writing to GPFS directly I'm able to write ~1800 files / second 
in a test setup. 
>>This is roughly the same on the protocol nodes (NSD client), as well as 
on the ESS IO nodes (NSD server). 

2. >> When writing to the NFS export on the protocol node itself (to avoid 
any network effects) I'm only able to write ~230 files / second.

IMHO #2, writing to the NFS export on the protocol node should be same as 
#1. Protocol node is also a NSD client and when you write from a protocol 
node, it will use the NSD protocol to write to the ESS IO nodes. In #1, 
you cite seeing ~1800 files from protocol node and in #2 you cite seeing 
~230 file/sec which seem to contradict each other. 

>>Writing to the NFS export from another node (now including network 
latency) gives me ~220 files / second.

IMHO, this workload "single client, single thread, small files, single 
directory - tar xf" is synchronous is nature and will result in single 
outstanding file to be sent from the NFS client to the CES node. Hence, 
the performance will be limited by network latency/capability between the 
NFS client and CES node for small IO size (~5KB file size). 

Also, what is the network interconnect/interface between the NFS client 
and CES node?  Is the network 10GigE since @220 file/s for 5KiB file-size 
will saturate 1 x 10GigE link. 

220 files/sec * 5KiB file size ==> ~1.126 GB/s. 

>> I'm aware that 'the real thing' would be to work with larger files in a 
multithreaded manner from multiple nodes - and that this scenario will 
scale quite well.

Yes, larger file-size + multiple threads + multiple NFS client nodes will 
help to scale performance further by having more NFS I/O requests 
scheduled/pipelined over the network and  processed on the  CES nodes. 

>> I just want to ensure that I'm not missing something obvious over 
reiterating that massage to customers.

Adding NFS experts/team, for advise. 

My two cents.

Best Regards,
-Kums





From:   "Alexander Saupp" 
To: gpfsug-discuss@spectrumscale.org
Date:   10/15/2018 02:20 PM
Subject:[gpfsug-discuss] Tuning: single client, single thread, 
small files - native Scale vs NFS
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Dear Spectrum Scale mailing list,

I'm part of IBM Lab Services - currently i'm having multiple customers 
asking me for optimization of a similar workloads.

The task is to tune a Spectrum Scale system (comprising ESS and CES 
protocol nodes) for the following workload: 
A single Linux NFS client mounts an NFS export, extracts a flat tar 
archive with lots of ~5KB files. 
I'm measuring the speed at which those 5KB files are written (`time tar xf 
archive.tar`). 

I do understand that Spectrum Scale is not designed for such workload 
(single client, single thread, small files, single directory), and that 
such benchmark in not appropriate to benmark the system. 
Yet I find myself explaining the performance for such scenario (git 
clone..) quite frequently, as customers insist that optimization of that 
scenario would impact individual users as it shows task duration.
I want to make sure that I have optimized the system as much as possible 
for the given workload, and that I have not overlooked something obvious.


When writing to GPFS directly I'm able to write ~1800 files / second in a 
test setup. 
This is roughly the same on the protocol nodes (NSD client), as well as on 
the ESS IO nodes (NSD server). 
When writing to the NFS export on the protocol node itself (to avoid any 
network effects) I'm only able to write ~230 files / second.
Writing to the NFS export from another node (now including network 
latency) gives me ~220 files / second.


There seems to be a huge performance degradation by adding NFS-Ganesha to 
the software stack alone. I wonder what can be done to minimize the 
impact.


- Ganesha doesn't seem to support 'async' or 'no_wdelay' options... 
anything equivalent available?
- Is there and expected advantage of using the network-latency tuned 
profile, as opposed to the ESS default throughput-performance profile?
- Are there other relevant Kernel params?
- Is there an expected advantage of raising the number of threads (NSD 
server (nsd*WorkerThreads) / NSD client (workerThreads) / Ganesha 
(NB_WORKER)) for the given workload (single client, single thread, small 
files)?
- Are there other relevant GPFS params?
- Impact of Sync replication, disk latency, etc is understood. 
- I'm aware that 'the real thing' would be to work with larger files in a 
multithreaded manner from multiple nodes - and that this scenario will 
scale quite well.
I just want to ensure that I'm not missing something obvious over 
reiterating that massage to customers.

Any help was greatly appreciated - thanks much in advance!
Alexander Saupp
IBM Germany


Mit freundlichen Grüßen / Kind regards

Alexander Saupp

IBM Systems, Storage Platform, EMEA Storage Competence Center


Phone:
+49 7034-643-1512
IBM Deutschland GmbH

Re: [gpfsug-discuss] What NSDs does a file have blocks on?

2018-07-09 Thread Kumaran Rajaram
Hi Kevin,

>>I want to know what NSDs a single file has its’ blocks on?

You may use /usr/lpp/mmfs/samples/fpo/mmgetlocationto obtain the 
file-to-NSD block layout map. Use the -h option for this tools usage (
mmgetlocation -h). 

Sample output is below:

# File-system block size is 4MiB and sample file is 40MiB.
# ls -lh /mnt/gpfs3a/data_out/lf
-rw-r--r-- 1 root root 40M Jul  9 16:42 /mnt/gpfs3a/data_out/lf
# du -sh /mnt/gpfs3a/data_out/lf
40M /mnt/gpfs3a/data_out/lf
# mmlsfs gpfs3a | grep 'Block size'
 -B 4194304  Block size

# The file data is striped across 10 x NSDs (DMD_NSDX) constituting the 
file-system

# /usr/lpp/mmfs/samples/fpo/mmgetlocation -f /mnt/gpfs3a/data_out/lf
[FILE /mnt/gpfs3a/data_out/lf INFORMATION]
 FS_DATA_BLOCKSIZE : 4194304 (bytes)
 FS_META_DATA_BLOCKSIZE : 4194304 (bytes)
 FS_FILE_DATAREPLICA : 1
 FS_FILE_METADATAREPLICA : 1
 FS_FILE_STORAGEPOOLNAME : system
 FS_FILE_ALLOWWRITEAFFINITY : no
 FS_FILE_WRITEAFFINITYDEPTH : 0
 FS_FILE_BLOCKGROUPFACTOR : 1

chunk(s)# 0 (offset 0) : [DMD_NSD5 c72f1m5u37ib0,c72f1m5u39ib0]
chunk(s)# 1 (offset 4194304) : [DMD_NSD6 c72f1m5u39ib0,c72f1m5u37ib0]
chunk(s)# 2 (offset 8388608) : [DMD_NSD7 c72f1m5u37ib0,c72f1m5u39ib0]
chunk(s)# 3 (offset 12582912) : [DMD_NSD8 c72f1m5u39ib0,c72f1m5u37ib0]
chunk(s)# 4 (offset 16777216) : [DMD_NSD9 c72f1m5u37ib0,c72f1m5u39ib0]
chunk(s)# 5 (offset 20971520) : [DMD_NSD10 c72f1m5u39ib0,c72f1m5u37ib0]
chunk(s)# 6 (offset 25165824) : [DMD_NSD1 c72f1m5u37ib0,c72f1m5u39ib0]
chunk(s)# 7 (offset 29360128) : [DMD_NSD2 c72f1m5u39ib0,c72f1m5u37ib0]
chunk(s)# 8 (offset 33554432) : [DMD_NSD3 c72f1m5u37ib0,c72f1m5u39ib0]
chunk(s)# 9 (offset 37748736) : [DMD_NSD4 c72f1m5u39ib0,c72f1m5u37ib0]


[FILE: /mnt/gpfs3a/data_out/lf SUMMARY INFO]
replica1:
c72f1m5u37ib0,c72f1m5u39ib0: 5 chunk(s)
c72f1m5u39ib0,c72f1m5u37ib0: 5 chunk(s)

Thanks and Regards,
-Kums






From:   "Buterbaugh, Kevin L" 
To: gpfsug main discussion list 
Date:   07/09/2018 04:05 PM
Subject:[gpfsug-discuss] What NSDs does a file have blocks on?
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi All, 

I am still working on my issue of the occasional high I/O wait times and 
that has raised another question … I know that I can run mmfileid to see 
what files have a block on a given NSD, but is there a way to do the 
opposite?  I.e. I want to know what NSDs a single file has its’ blocks on? 
 The mmlsattr command does not appear to show this information unless it’s 
got an undocumented option.  Thanks…

Kevin

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and 
Education
kevin.buterba...@vanderbilt.edu - (615)875-9633


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Lroc on NVME

2018-06-12 Thread Kumaran Rajaram
Hi,
 
>>Yes, older versions of GPFS don't recognize /dev/nvme*. So you would need /var/mmfs/etc/nsddevices user exit. >>On newer GPFS versions, the nvme devices are also generic
 but has anyone else tried to get lroc running on nvme and how well does it work.
 
IMHO, the support to recognize /dev/nvme* was added to Spectrum Scale version 5.0.1.
 
The Spectrum Scale version 5.0.1 has LROC performance enhancements compared to the earlier versions, for file stat/read performance from LROC devices.
 
Following provides sample performance data using single  1 x mdtest MPI process with LROC data on INTEL SSDPEDMD016T4L for Spectrum Scale version 5.0.0 vs. 5.0.1.
 
Benchmark Arguments:           mpiexec -f $MACH_FILE -n $MAX_NP $BENCHMARK -i 1 -n $n_files -u -F -T -E -e $file_sz -d $O_DIR     Sample: mpiexec -f /mnt/sw_x86/mpich/mf.perf_x86.c72f1m5u27 -n 1 /mnt/sw_x86/benchmarks/mdtest/mdtest -i 1 -n 65536 -u -F -T -E -e '1024' -d /mnt/gpfs3a/lroc_mdtest_out/uniq_dir_1024_65536             5.0.0 5.0.1 Performance Delta (%)   File metadata Ops/s File metadata Ops/s File StatFile Read File CountFile Size (KiB)File StatFile Read File StatFile Read    16384141703979 51824765 24.2919.77 163843239383585 52204609 32.5728.56 65536115111319 21221893 40.4643.50 65536321418661 2214803 56.1821.52 
 
Best Regards,
-Kums
 
- Original message -From: "Truong Vu" Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug-discuss@spectrumscale.orgCc:Subject: Re: [gpfsug-discuss] Lroc on NVMEDate: Tue, Jun 12, 2018 9:55 AM 
Yes, older versions of GPFS don't recognize /dev/nvme*. So you would need /var/mmfs/etc/nsddevices user exit. On newer GPFS versions, the nvme devices are also generic. So, it is good that you are using the same NSD sub-type.Cheers,Tru.gpfsug-discuss-request---06/12/2018 06:47:05 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss@spectrumscale.orgFrom: gpfsug-discuss-requ...@spectrumscale.orgTo: gpfsug-discuss@spectrumscale.orgDate: 06/12/2018 06:47 AMSubject: 

Re: [gpfsug-discuss] GPFS 4.2.3.4 question

2017-08-27 Thread Kumaran Rajaram
Hi Kevin,

>> Thanks - important followup question … does 4.2.3.4 contain the fix for 
the mmrestripefs data loss bug that was announced last week?  Thanks 
again…

I presume, by "mmrestripefs data loss bug" you are referring to APAR 
IV98609 (link below)? If yes, 4.2.3.4 contains the fix for APAR IV98609.

http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487

Problems fixed in GPFS 4.2.3.4 (details in link below):

https://www.ibm.com/developerworks/community/forums/html/topic?id=f3705faa-b6aa-415c-a3e6-1fe9d8293db1=25

* This update addresses the following APARs: IV98545 IV98609 IV98640 
IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 
IV99059 IV99060 IV99062 IV99063. 

Regards,
-Kums




From:   "Buterbaugh, Kevin L" 
To: gpfsug main discussion list 
Date:   08/27/2017 09:32 AM
Subject:Re: [gpfsug-discuss] GPFS 4.2.3.4 question
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Fred / All, 

Thanks - important followup question … does 4.2.3.4 contain the fix for 
the mmrestripefs data loss bug that was announced last week?  Thanks 
again…

Kevin

On Aug 26, 2017, at 7:35 PM, Frederick Stock  wrote:

The only change missing is the change delivered  in 4.2.3 PTF3 efix3 which 
was provided on August 22.  The problem had to do with NSD deletion and 
creation.

Fred
__
Fred Stock | IBM Pittsburgh Lab | 720-430-8821
sto...@us.ibm.com



From:"Buterbaugh, Kevin L" 
To:gpfsug main discussion list 
Date:08/26/2017 03:40 PM
Subject:[gpfsug-discuss] GPFS 4.2.3.4 question
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi All, 

Does anybody know if GPFS 4.2.3.4, which came out today, contains all the 
patches that are in GPFS 4.2.3.3 efix3?

If anybody does, and can respond, I’d greatly appreciate it.  Our cluster 
is in a very, very bad state right now and we may need to just take it 
down and bring it back up.  I was already planning on rolling out GPFS 
4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 
4.2.3.4 that would be great…

Thanks!

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and 
Education
kevin.buterba...@vanderbilt.edu- (615)875-9633


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0=




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=McIf98wfiVqHU8ZygezLrQ=0rUCqrbJ4Ny44Rmr8x8HvX5q4yqS-4tkN02fiIm9ttg=FYfr0P3sVBhnGGsj33W-A9JoDj7X300yTt5D4y5rpJY=
 




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing

2017-08-25 Thread Kumaran Rajaram
Hi,

>>I was wondering if there are any good performance sizing guides for a 
spectrum scale shared nothing architecture (FPO)?
>> I don't have any production experience using spectrum scale in a 
"shared nothing configuration " and was hoping for bandwidth / throughput 
sizing guidance. 

Please ensure that all the recommended FPO settings (e.g. 
allowWriteAffinity=yes in the FPO storage pool, readReplicaPolicy=local, 
restripeOnDiskFailure=yes)  are set properly. Please find the FPO Best 
practices/tunings, in the links below: 
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Big%20Data%20Best%20practices
https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ab5c2792-feef-4a3a-a21b-d22c6f5d728a/attachment/80d5c300-7b39-4d6e-9596-84934fcc4638/media/Deploying_a_big_data_solution_using_IBM_Spectrum_Scale_v1.7.5.pdf

>> For example, each node might consist of 24x storage drives (locally 
attached JBOD, no RAID array).
>> Given a particular node configuration I want to be in a position to 
calculate the maximum bandwidth / throughput.

With FPO, GPFS metadata (-m) and data replication (-r) needs to be 
enabled.  The Write-affinity-Depth (WAD) setting defines the policy for 
directing writes. It indicates that the node writing the data directs the 
write to disks on its own node for the first copy and to the disks on 
other nodes for the second and third copies (if specified). 
readReplicaPolicy=local will enable the policy to read replicas from local 
disks.

At the minimum, ensure that the networking used for GPFS is sized properly 
and has bandwidth 2X or 3X that of the local disk speeds to ensure FPO 
write bandwidth is not being constrained by GPFS replication over the 
network. 

For example, if 24 x Drives in RAID-0 results in ~4.8 GB/s (assuming 
~200MB/s per drive) and GPFS metadata/data replication is set to 3 (-m 3 
-r 3) then for optimal FPO write bandwidth, we need to ensure the 
network-interconnect between the FPO nodes is non-blocking/high-speed and 
can sustain ~14.4 GB/s ( data_replication_factor * 
local_storage_bandwidth). One possibility, is minimum of 2 x EDR 
Infiniband  (configure GPFS verbsRdma/verbsPorts) or bonded 40GigE between 
the FPO nodes (for GPFS daemon-to-daemon communication). Application reads 
requiring FPO reads from remote GPFS node would as well benefit from 
high-speed network-interconnect between the FPO nodes. 

Regards,
-Kums





From:   Evan Koutsandreou 
To: "gpfsug-discuss@spectrumscale.org" 

Date:   08/20/2017 11:06 PM
Subject:[gpfsug-discuss] Shared nothing (FPO) throughout / 
bandwidth sizing
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi -

I was wondering if there are any good performance sizing guides for a 
spectrum scale shared nothing architecture (FPO)?

For example, each node might consist of 24x storage drives (locally 
attached JBOD, no RAID array).

I don't have any production experience using spectrum scale in a "shared 
nothing configuration " and was hoping for bandwidth / throughput sizing 
guidance. 

Given a particular node configuration I want to be in a position to 
calculate the maximum bandwidth / throughput.

Thank you 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Baseline testing GPFS with gpfsperf

2017-07-26 Thread Kumaran Rajaram
Hi Scott,

>>- Should the number of threads equal the number of NSDs for the file 
system? or equal to the number of nodes? 
>>- If I execute a large multi-threaded run of this tool from a single 
node in the cluster, will that give me an accurate result of the 
performance of the file system?  

To add to Valdis's note,  the answer to above also depends on the node, 
network used for GPFS communication between client and server, as well as 
storage performance capabilities constituting the GPFS 
cluster/network/storage stack. 

As an example, if the storage subsystem (including controller + disks) 
hosting the file-system can deliver ~20 GB/s and the networking between 
NSD client and server is FDR 56Gb/s Infiniband (with verbsRdma = ~6GB/s). 
Assuming, one FDR-IB link (verbsPorts) is configured per NSD server as 
well as client, then you could need minimum of 4 x NSD servers (4 x 6GB/s 
==> 24 GB/s) to saturate the backend storage.  So, you would need to run 
gpfsperf (or anyother parallel I/O benchmark) across minimum of 4 x GPFS 
NSD clients to saturate the backend storage.  You can scale the gpfsperf 
thread counts (-th parameter) depending on access pattern (buffered/dio 
etc) but this would only be able to drive load from single NSD client 
node. If you would like to drive I/O load from multiple NSD client nodes + 
synchronize the parallel runs across multiple nodes for accuracy, then 
gpfsperf-mpi would be strongly recommended. You would need to use MPI to 
launch gpfsperf-mpi across multiple NSD client nodes and scale the MPI 
processes (across NSD clients with 1 or more MPI process per NSD client) 
accordingly to drive the I/O load for good performance. 

>>The cluster that I will be running this tool on will not have MPI 
installed and will have multiple file systems in the cluster. 

Without MPI, alternative would be to use ssh or pdsh to launch gpfsperf 
across multiple nodes however if there are slow NSD clients then the 
performance may not be accurate (slow clients taking longer and after 
faster clients finished it will get all the network/storage resources 
skewing the performance analysis. You may also consider using parallel 
Iozone as it can be run across multiple node using rsh/ssh with 
combination of  "-+m" and "-t" option. 

http://iozone.org/docs/IOzone_msword_98.pdf

##
-+m filename 
Use this file to obtain the configuration informati
on of the clients for cluster testing. The file 
contains one line for each client. Each line has th
ree fields. The fields are space delimited. A # 
sign in column zero is a comment line. The first fi
eld is the name of the client. The second field is 
the path, on the client, for the working directory 
where Iozone will execute. The third field is the 
path, on the client, for the executable Iozone. 
To use this option one must be able to execute comm
ands on the clients without being challenged 
for a password. Iozone will start remote execution 
by using “rsh"

To use ssh, export RSH=/usr/bin/ssh 

-t #
Run Iozone in a throughput mode. This option allows
 the user to specify how 
many threads or processes to have active during  th
e measurement.
##

Hope this helps,
-Kums





From:   valdis.kletni...@vt.edu
To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Date:   07/25/2017 07:59 PM
Subject:Re: [gpfsug-discuss] Baseline testing GPFS with gpfsperf
Sent by:gpfsug-discuss-boun...@spectrumscale.org



On Tue, 25 Jul 2017 15:46:45 -0500, "Scott C Batchelder" said:

> - Should the number of threads equal the number of NSDs for the file
> system? or equal to the number of nodes?

Depends on what definition of "throughput" you are interested in. If your
configuration has 50 clients banging on 5 NSD servers, your numbers for 5
threads and 50 threads are going to tell you subtly different things...

(Basically, one thread per NSD is going to tell you the maximum that
one client can expect to get with little to no contention, while one
per client will tell you about the maximum *aggregate* that all 50
can get together - which is probably still giving each individual client
less throughput than one-to-one)

We usually test with "exactly one thread total", "one thread per server",
and "keep piling the clients on till the total number doesn't get any 
bigger".

Also be aware that it only gives you insight to your workload performance 
if
your workload is comprised of large file access - if your users are 
actually
doing a lot of medium or small files, that changes the results 
dramatically
as you end up possibly pounding on metadata more than the actual data
[attachment "att0twxd.dat" deleted by Kumaran Rajaram/Arlington/IBM] 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] get free space in GSS

2017-07-09 Thread Kumaran Rajaram
Hi Atmane,

>> I can not find the free space

Based on your output below, your setup currently has two recovery groups 
BB1RGL and BB1RGR.

Issue "mmlsrecoverygroup BB1RGL -L" and "mmlsrecoverygroup BB1RGR -L" to 
obtain free space in each DA.

Based on your "mmlsrecoverygroup BB1RGL -L" output below, BB1RGL "DA1" has 
12GiB and "DA2" has 4GiB free space. The metadataOnly and dataOnly 
vdisk/NSD are created from DA1 and DA2. 

 declustered   needsreplace scrub background 
activity
array service  vdisks  pdisks  spares  threshold  free space 
duration  task   progress  priority
 ---  ---  --  --  --  -  -- 
  -
 LOG  no1   3 0,0  1 558 GiB   14 
days  scrub   51%  low
 DA1  no   11  582,31  2  12 GiB   14 
days  scrub   78%  low
 DA2  no6  582,31  24096 MiB   14 
days  scrub   10%  low

In addition, you may use "mmlsnsd" to obtain mapping of file-system to 
vdisk/NSD + use "mmdf " command to query user or available capacity on 
a GPFS file system.

Hope this helps,
-Kums





From:   atmane khiredine 
To: Laurence Horrocks-Barlow , "gpfsug main 
discussion list" 
Date:   07/09/2017 08:27 AM
Subject:Re: [gpfsug-discuss] get free space in GSS
Sent by:gpfsug-discuss-boun...@spectrumscale.org



thank you very much for replying. I can not find the free space

Here is the output of mmlsrecoverygroup

[root@server1 ~]#mmlsrecoverygroup

 declustered
 arrays with
 recovery groupvdisks vdisks  servers
 --  ---  --  ---
 BB1RGL3  18  server1,server2
 BB1RGR3  18  server2,server1
--
[root@server ~]# mmlsrecoverygroup BB1RGL -L

declustered
 recovery group   arrays vdisks  pdisks  format version
 -  ---  --  --  --
 BB1RGL   3  18 119  4.2.0.1

 declustered   needsreplace scrub background 
activity
array service  vdisks  pdisks  spares  threshold  free space 
duration  task   progress  priority
 ---  ---  --  --  --  -  -- 
  -
 LOG  no1   3 0,0  1 558 GiB   14 
days  scrub   51%  low
 DA1  no   11  582,31  2  12 GiB   14 
days  scrub   78%  low
 DA2  no6  582,31  24096 MiB   14 
days  scrub   10%  low

 declustered checksum
 vdisk   RAID code  array vdisk size  block 
size  granularity  state remarks
 --  --  ---  -- 
--  ---  - ---
 gss0_logtip 3WayReplication LOG 128 MiB  1 
MiB   512  oklogTip
 gss0_loghome4WayReplication DA1  40 GiB  1 
MiB   512  oklog
 BB1RGL_GPFS4_META14WayReplication DA1 451 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS4_DATA18+2pDA15133 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS1_META14WayReplication DA1 451 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS1_DATA18+2pDA1  12 TiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS3_META1   4WayReplication DA1 451 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS3_DATA1   8+2pDA1  12 TiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS2_META1   4WayReplication DA1 451 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS2_DATA1   8+2pDA1  13 TiB  2 
MiB 32 KiB ok
 BB1RGL_GPFS2_META2   4WayReplication DA2 451 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS2_DATA2   8+2pDA2  13 TiB  2 
MiB 32 KiB ok
 BB1RGL_GPFS1_META24WayReplication DA2 451 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS1_DATA28+2pDA2  12 TiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS5_META1   4WayReplication DA1 750 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS5_DATA1   8+2pDA1  70 TiB 16 
MiB 32 KiB ok
 BB1RGL_GPFS5_META2   4WayReplication DA2 750 GiB  1 
MiB 32 KiB ok
 BB1RGL_GPFS5_DATA2   8+2pDA2  90 TiB 16 
MiB 32 KiB ok

 config data declustered array   VCD spares actual rebuild 

Re: [gpfsug-discuss] IO prioritisation / throttling?

2017-06-23 Thread Kumaran Rajaram
Hi John,

>>We have a GPFS Setup using Fujitsu filers and Mellanox infiniband.
>>The desire it to set up an environment for test and development where if 
IO ‘runs wild’ it will not bring down
>>the production storage. 

You may use the Spectrum Scale Quality of Service for I/O "mmchqos" 
command (details in link below) to define IOPS limits for the "others" as 
well as the "maintenance" class for the Dev/Test file-system "pools" (for 
e.g., mmchqos tds_fs --enable  pool=*,other=1IOPS, 
maintenance=5000IOPS).  This way, the Test and Dev 
file-system/storage-pools IOPS can be limited/controlled to specified IOPS 
, giving higher priority to the production GPFS file-system/storage (with 
production_fs pool=* other=unlimited,maintenance=unlimited - which is the 
default). 

https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_mmchqos.htm
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_qosio_describe.htm#qosio_describe

My two cents.

Regards,
-Kums





From:   John Hearns 
To: gpfsug main discussion list 
Date:   06/23/2017 04:14 AM
Subject:[gpfsug-discuss] IO prioritisation / throttling?
Sent by:gpfsug-discuss-boun...@spectrumscale.org



I guess this is a rather ill-defined question, and I realise it will be 
open to a lot of interpretations.
We have a GPFS Setup using Fujitsu filers and Mellanox infiniband.
The desire it to set up an environment for test and development where if 
IO ‘runs wild’ it will not bring down
the production storage. If anyone has a setup like this I would be 
interested in chatting with you.
Is it feasible to create filesets which have higher/lower priority than 
others?
 
Thankyou for any insights or feedback
John Hearns
-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the 
intended recipient(s). Any unauthorized review, use, disclosure or 
distribution is prohibited. Unless explicitly stated otherwise in the body 
of this communication or the attachment thereto (if any), the information 
is provided on an AS-IS basis without any express or implied warranties or 
liabilities. To the extent you are relying on this information, you are 
doing so at your own risk. If you are not the intended recipient, please 
notify the sender immediately by replying to this message and destroy all 
copies of this message and any attachments. Neither the sender nor the 
company/group of companies he or she represents shall be liable for the 
proper and complete transmission of the information contained in this 
communication, or for any delay in its receipt. 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] 4.2.3.x and sub-block size

2017-06-14 Thread Kumaran Rajaram
Hi,

>>Back at SC16 I was told that GPFS 4.2.3.x would remove the “a sub-block 
is 1/32nd of the block size” restriction.  However, I have installed GPFS 
4.2.3.1 on my test cluster and in the man page for mmcrfs I still see:
>>So has the restriction been removed?  If not, is there an update on 
which version of GPFS will remove it?  If so, can the documentation be 
updated to reflect the change and how to take advantage of it?  Thanks…

Based on the current plan, this “a sub-block is 1/32nd of the block size” 
restriction will be removed in the upcoming GPFS version 4.2.4 (Please 
NOTE: Support for >32 subblocks per block may subject to be delayed based 
on internal qualification/validation efforts).

Regards,
-Kums





From:   "Buterbaugh, Kevin L" 
To: gpfsug main discussion list 
Date:   06/14/2017 12:12 PM
Subject:[gpfsug-discuss] 4.2.3.x and sub-block size
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi All, 

Back at SC16 I was told that GPFS 4.2.3.x would remove the “a sub-block is 
1/32nd of the block size” restriction.  However, I have installed GPFS 
4.2.3.1 on my test cluster and in the man page for mmcrfs I still see:

2. The GPFS block size determines:

   *  The minimum disk space allocation unit. The minimum amount
  of space that file data can occupy is a sub‐block. A
  sub‐block is 1/32 of the block size.

So has the restriction been removed?  If not, is there an update on which 
version of GPFS will remove it?  If so, can the documentation be updated 
to reflect the change and how to take advantage of it?  Thanks…

Kevin

?
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and 
Education
kevin.buterba...@vanderbilt.edu - (615)875-9633


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Well, this is the pits...

2017-05-04 Thread Kumaran Rajaram
>>Thanks for the info on the releases … can you clarify about 
pitWorkerThreadsPerNode? 

pitWorkerThreadsPerNode -- Specifies how many threads do restripe, data 
movement, etc

>>As I said in my original post, on all 8 NSD servers and the filesystem 
manager it is set to zero.  No matter how many times I add zero to zero I 
don’t get a value > 31!  ;-)  So I take it that zero has some sort of 
unspecified significance?  Thanks…

Value of 0 just indicates pitWorkerThreadsPerNode takes internal_value 
based on GPFS setup and file-system configuration (which can be 16 or 
lower) based on the following formula.

Default is  pitWorkerThreadsPerNode = MIN(16, (numberOfDisks_in_filesystem 
* 4) / numberOfParticipatingNodes_in_mmrestripefs + 1) 

For example, if you have 64 x NSDs in your file-system and you are using 8 
NSD servers in "mmrestripefs -N", then

pitWorkerThreadsPerNode = MIN (16, (256/8)+1) resulting in 
pitWorkerThreadsPerNode to take value of 16 ( default 0 will result in 16 
threads doing restripe per mmrestripefs participating Node).

If you want 8 NSD servers (running 4.2.2.3) to participate in mmrestripefs 
operation then set "mmchconfig pitWorkerThreadsPerNode=3 -N 
<8_NSD_Servers>" such that (8 x 3) is less than 31.

Regards,
-Kums





From:   "Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu>
To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Date:   05/04/2017 12:57 PM
Subject:Re: [gpfsug-discuss] Well, this is the pits...
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi Kums, 

Thanks for the info on the releases … can you clarify about 
pitWorkerThreadsPerNode?  As I said in my original post, on all 8 NSD 
servers and the filesystem manager it is set to zero.  No matter how many 
times I add zero to zero I don’t get a value > 31!  ;-)  So I take it that 
zero has some sort of unspecified significance?  Thanks…

Kevin

On May 4, 2017, at 11:49 AM, Kumaran Rajaram <k...@us.ibm.com> wrote:

Hi,

>>I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 
4.2.0.3 and are gradually being upgraded).  What version of GPFS fixes 
this?  With what I’m doing I need the ability to run mmrestripefs.

GPFS version 4.2.3.0 (and above) fixes this issue and supports "sum of 
pitWorkerThreadsPerNode of the participating nodes (-N parameter to 
mmrestripefs)" to exceed 31.

If you are using 4.2.2.3, then depending on "number of nodes participating 
in the mmrestripefs" then the GPFS config parameter 
"pitWorkerThreadsPerNode" need to be adjusted such that "sum of 
pitWorkerThreadsPerNode of the participating nodes <=  31".

For example, if  "number of nodes participating in the mmrestripefs" is 6 
then adjust "mmchconfig pitWorkerThreadsPerNode=5 -N 
". GPFS would need to be restarted for this parameter 
to take effect on the participating_nodes (verify with  mmfsadm dump 
config | grep pitWorkerThreadsPerNode)

Regards,
-Kums





From:"Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu>
To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Date:05/04/2017 12:08 PM
Subject:Re: [gpfsug-discuss] Well, this is the pits...
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi Olaf, 

I didn’t touch pitWorkerThreadsPerNode … it was already zero.

I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 
4.2.0.3 and are gradually being upgraded).  What version of GPFS fixes 
this?  With what I’m doing I need the ability to run mmrestripefs.

It seems to me that mmrestripefs could check whether QOS is enabled … 
granted, it would have no way of knowing whether the values used actually 
are reasonable or not … but if QOS is enabled then “trust” it to not 
overrun the system.

PMR time?  Thanks..

Kevin

On May 4, 2017, at 10:54 AM, Olaf Weiser <olaf.wei...@de.ibm.com> wrote:

HI Kevin, 
the number of NSDs is more or less nonsense .. it is just the number of 
nodes x PITWorker  should not exceed to much the #mutex/FS block
did you adjust/tune the PitWorker ? ... 

so far as I know.. that the code checks the number of NSDs is already 
considered as a defect and will be fixed / is already fixed ( I stepped 
into it here as well) 

ps. QOS is the better approach to address this, but unfortunately.. not 
everyone is using it by default... that's why I suspect , the development 
decide to put in a check/limit here .. which in your case(with QOS) 
would'nt needed 





From:"Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu>
To:gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Date:05/04/2017 05:44 PM
Subject:Re: [gpfsug-discuss] Well, this is the pits...
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi Olaf, 

Your explanation most

Re: [gpfsug-discuss] Well, this is the pits...

2017-05-04 Thread Kumaran Rajaram
Hi,

>>I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 
4.2.0.3 and are gradually being upgraded).  What version of GPFS fixes 
this?  With what I’m doing I need the ability to run mmrestripefs.

GPFS version 4.2.3.0 (and above) fixes this issue and supports "sum of 
pitWorkerThreadsPerNode of the participating nodes (-N parameter to 
mmrestripefs)" to exceed 31.

If you are using 4.2.2.3, then depending on "number of nodes participating 
in the mmrestripefs" then the GPFS config parameter 
"pitWorkerThreadsPerNode" need to be adjusted such that "sum of 
pitWorkerThreadsPerNode of the participating nodes <=  31".

For example, if  "number of nodes participating in the mmrestripefs" is 6 
then adjust "mmchconfig pitWorkerThreadsPerNode=5 -N 
". GPFS would need to be restarted for this parameter 
to take effect on the participating_nodes (verify with  mmfsadm dump 
config | grep pitWorkerThreadsPerNode)

Regards,
-Kums





From:   "Buterbaugh, Kevin L" 
To: gpfsug main discussion list 
Date:   05/04/2017 12:08 PM
Subject:Re: [gpfsug-discuss] Well, this is the pits...
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi Olaf, 

I didn’t touch pitWorkerThreadsPerNode … it was already zero.

I’m running 4.2.2.3 on my GPFS servers (some clients are on 4.2.1.1 or 
4.2.0.3 and are gradually being upgraded).  What version of GPFS fixes 
this?  With what I’m doing I need the ability to run mmrestripefs.

It seems to me that mmrestripefs could check whether QOS is enabled … 
granted, it would have no way of knowing whether the values used actually 
are reasonable or not … but if QOS is enabled then “trust” it to not 
overrun the system.

PMR time?  Thanks..

Kevin

On May 4, 2017, at 10:54 AM, Olaf Weiser  wrote:

HI Kevin, 
the number of NSDs is more or less nonsense .. it is just the number of 
nodes x PITWorker  should not exceed to much the #mutex/FS block
did you adjust/tune the PitWorker ? ... 

so far as I know.. that the code checks the number of NSDs is already 
considered as a defect and will be fixed / is already fixed ( I stepped 
into it here as well) 

ps. QOS is the better approach to address this, but unfortunately.. not 
everyone is using it by default... that's why I suspect , the development 
decide to put in a check/limit here .. which in your case(with QOS) 
would'nt needed 





From:"Buterbaugh, Kevin L" 
To:gpfsug main discussion list 
Date:05/04/2017 05:44 PM
Subject:Re: [gpfsug-discuss] Well, this is the pits...
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi Olaf, 

Your explanation mostly makes sense, but...

Failed with 4 nodes … failed with 2 nodes … not gonna try with 1 node. And 
this filesystem only has 32 disks, which I would imagine is not an 
especially large number compared to what some people reading this e-mail 
have in their filesystems.

I thought that QOS (which I’m using) was what would keep an mmrestripefs 
from overrunning the system … QOS has worked extremely well for us - it’s 
one of my favorite additions to GPFS.

Kevin

On May 4, 2017, at 10:34 AM, Olaf Weiser  wrote:

no.. it is just in the code, because we have to avoid to run out of mutexs 
/ block

reduce the number of nodes -N down to 4  (2nodes is even more safer) ... 
is the easiest way to solve it for now

I've been told the real root cause will be fixed in one of the next ptfs 
.. within this year .. 
this warning messages itself should appear every time.. but unfortunately 
someone coded, that it depends on the number of disks (NSDs).. that's why 
I suspect you did'nt see it before
but the fact , that we have to make sure, not to overrun the system by 
mmrestripe  remains.. to please lower the -N number of nodes to 4 or 
better 2 

(even though we know.. than the mmrestripe will take longer)


From:"Buterbaugh, Kevin L" 
To:gpfsug main discussion list 
Date:05/04/2017 05:26 PM
Subject:[gpfsug-discuss] Well, this is the pits...
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi All, 

Another one of those, “I can open a PMR if I need to” type questions…

We are in the process of combining two large GPFS filesystems into one new 
filesystem (for various reasons I won’t get into here).  Therefore, I’m 
doing a lot of mmrestripe’s, mmdeldisk’s, and mmadddisk’s.

Yesterday I did an “mmrestripefs  -r -N ” (after 
suspending a disk, of course).  Worked like it should.

Today I did a “mmrestripefs  -b -P capacity -N ” and got:

mmrestripefs: The total number of PIT worker threads of all participating 
nodes has been exceeded to safely restripe the file system.  The total 
number of PIT worker threads, which is the sum of 

Re: [gpfsug-discuss] RAID config for SSD's - potential pitfalls

2017-04-19 Thread Kumaran Rajaram
Hi,

>> As I've mentioned before, RAID choices for GPFS are not so simple. Here 
are  a couple points to consider, I'm sure there's more.  And if I'm 
wrong, someone will please correct me - but I believe the two biggest 
pitfalls are:

>>Some RAID configurations (classically 5 and 6) work best with large, 
full block writes.  When the file system does a partial block write, RAID 
may have to read a full "stripe" from several devices, compute the 
differences and then write back the modified data to several devices. 
>>This is certainly true with RAID that is configured over several storage 
devices, with error correcting codes.  SO, you do NOT want to put GPFS 
metadata (system pool!) on RAID configured with large stripes and error 
correction. This is the Read-Modify-Write Raid pitfall.

As you pointed out, the RAID choices for GPFS may not be simple and we 
need to take into consideration factors such as storage subsystem 
configuration/capabilities such as if all drives are homogenous or there 
is mix of drives. If all the drives are homogeneous, then create 
dataAndMetadata NSDs across RAID-6 and if the storage  controller supports 
write-cache + write-cache mirroring (WC + WM) then enable this (WC +WM) 
can alleviate read-modify-write for small writes (typical in metadata). If 
there is MIX of SSD and HDD (e.g. 15K RPM), then we need to take into 
consideration the aggregate IOPS of RAID-1 SSD volumes vs. RAID-6 HDDs 
before separating data and metadata into separate media. For example, if 
the storage subsystem has 2 x SSDs and ~300 x 15K RPM or NL_SAS HDDs then 
most likely aggregate IOPS of RAID-6 HDD volumes will be higher than 
RAID-1 SSD volumes. It would be recommended to also assess the I/O 
performance on different configuration (dataAndMetadata vs 
dataOnly/metadataOnly NSDs) with some application workload + production 
scenarios before deploying the final solution. 

>> GPFS has built-in replication features - consider using those instead 
of RAID replication (classically Raid-1).  GPFS replication can work with 
storage devices that are in different racks, separated by significant 
physical space, and from different manufacturers.  This can be more 
>>robust than RAID in a single box or single rack.  Consider a fire 
scenario, or exploding power supply or similar physical disaster. Consider 
that storage devices and controllers from the same manufacturer may have 
the same bugs, defects, failures.

For high-resiliency (for e.g. metadataOnly) and if there are multiple 
storage across different failure domains (different racks/rooms/DC etc), 
it will be good to enable BOTH hardware RAID-1 as well as GPFS metadata 
replication enabled (at the minimum,  -m 2). 

If there is single shared storage for GPFS file-system storage and 
metadata is separated from data, then RAID-1 would minimize administrative 
overhead compared to GPFS replication in the event of drive failure (since 
with GPFS replication across single SSD would require 
mmdeldisk/mmdelnsd/mmcrnsd/mmadddisk every time disk goes faulty and needs 
to be replaced). 

Best,
-Kums






From:   Marc A Kaplan/Watson/IBM@IBMUS
To: gpfsug main discussion list 
Date:   04/19/2017 04:50 PM
Subject:Re: [gpfsug-discuss] RAID config for SSD's - potential 
pitfalls
Sent by:gpfsug-discuss-boun...@spectrumscale.org



As I've mentioned before, RAID choices for GPFS are not so simple.Here 
are  a couple points to consider, I'm sure there's more.  And if I'm 
wrong, someone will please correct me - but I believe the two biggest 
pitfalls are:
Some RAID configurations (classically 5 and 6) work best with large, full 
block writes.  When the file system does a partial block write, RAID may 
have to read a full "stripe" from several devices, compute the differences 
and then write back the modified data to several devices.  This is 
certainly true with RAID that is configured over several storage devices, 
with error correcting codes.  SO, you do NOT want to put GPFS metadata 
(system pool!) on RAID configured with large stripes and error correction. 
This is the Read-Modify-Write Raid pitfall.
GPFS has built-in replication features - consider using those instead of 
RAID replication (classically Raid-1).  GPFS replication can work with 
storage devices that are in different racks, separated by significant 
physical space, and from different manufacturers.  This can be more robust 
than RAID in a single box or single rack.  Consider a fire scenario, or 
exploding power supply or similar physical disaster.  Consider that 
storage devices and controllers from the same manufacturer may have the 
same bugs, defects, failures. 

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at 

Re: [gpfsug-discuss] question on viewing block distribution across NSDs

2017-03-30 Thread Kumaran Rajaram
Hi,

Yes, you could use "mmdf" to obtain file-system "usage" across the NSDs 
(comprising the file-system).

If you want to obtain "data block distribution corresponding to a file 
across the NSDs", then there is a utility "mmgetlocation" in 
/usr/lpp/mmfs/samples/fpo that can be used to get file-data-blocks to NSD 
mapping. 

Example: 

# File-system comprises of single storage pool, all NSDs configured as 
dataAndMetadata, -m 1 -r 1, FS block-size=2MiB
# mmlsfs gpfs1b | grep 'Block size'
 -B 2097152  Block size

# The file-system is comprised of 10 x dataAndMetadata NSDs
# mmlsdisk gpfs1b | grep DMD | wc -l
10

# Create a sample file that is 40MiB (20 data blocks)
/mnt/sw/benchmarks/gpfsperf/gpfsperf create seq -r 2m -n 40m 
/mnt/gpfs1b/temp_dir/lf.s.1

# File size is 40 MiB
(09:52:49) c25m3n07:~ # ls -lh /mnt/gpfs1b/temp_dir/lf.s.1
-rw-r--r-- 1 root root 40M Mar 17 09:52 /mnt/gpfs1b/temp_dir/lf.s.1
(09:52:54) c25m3n07:~ # du -sh /mnt/gpfs1b/temp_dir/lf.s.1
40M /mnt/gpfs1b/temp_dir/lf.s.1

# Verified through mmgetlocation that the file data blocks is uniformly 
striped across all the dataAndMetadata NSDs, with each NSD containing 2 
file data blocks
# In the output below, "DMD_NSDX" is name of the NSDs. 
(09:53:00) c25m3n07:~ # /usr/lpp/mmfs/samples/fpo/mmgetlocation -f 
/mnt/gpfs1b/temp_dir/lf.s.1

[FILE INFO]


blockSize   2   MB
blockGroupFactor1
metadataBlockSize   2M
writeAffinityDepth  0
flags:
data replication: 1 max 2
storage pool name:system
metadata replication: 1 max 2

Chunk 0 (offset 0) is located at disks:  [ DMD_NSD09 
c25m3n07-ib,c25m3n08-ib ]
Chunk 1 (offset 2097152) is located at disks:  [ DMD_NSD10 
c25m3n08-ib,c25m3n07-ib ]
Chunk 2 (offset 4194304) is located at disks:  [ DMD_NSD01 
c25m3n07-ib,c25m3n08-ib ]
Chunk 3 (offset 6291456) is located at disks:  [ DMD_NSD02 
c25m3n08-ib,c25m3n07-ib ]
Chunk 4 (offset 8388608) is located at disks:  [ DMD_NSD03 
c25m3n07-ib,c25m3n08-ib ]
Chunk 5 (offset 10485760) is located at disks:  [ DMD_NSD04 
c25m3n08-ib,c25m3n07-ib ]
Chunk 6 (offset 12582912) is located at disks:  [ DMD_NSD05 
c25m3n07-ib,c25m3n08-ib ]
Chunk 7 (offset 14680064) is located at disks:  [ DMD_NSD06 
c25m3n08-ib,c25m3n07-ib ]
Chunk 8 (offset 16777216) is located at disks:  [ DMD_NSD07 
c25m3n07-ib,c25m3n08-ib ]
Chunk 9 (offset 18874368) is located at disks:  [ DMD_NSD08 
c25m3n08-ib,c25m3n07-ib ]
Chunk 10 (offset 20971520) is located at disks:  [ DMD_NSD09 
c25m3n07-ib,c25m3n08-ib ]
Chunk 11 (offset 23068672) is located at disks:  [ DMD_NSD10 
c25m3n08-ib,c25m3n07-ib ]
Chunk 12 (offset 25165824) is located at disks:  [ DMD_NSD01 
c25m3n07-ib,c25m3n08-ib ]
Chunk 13 (offset 27262976) is located at disks:  [ DMD_NSD02 
c25m3n08-ib,c25m3n07-ib ]
Chunk 14 (offset 29360128) is located at disks:  [ DMD_NSD03 
c25m3n07-ib,c25m3n08-ib ]
Chunk 15 (offset 31457280) is located at disks:  [ DMD_NSD04 
c25m3n08-ib,c25m3n07-ib ]
Chunk 16 (offset 33554432) is located at disks:  [ DMD_NSD05 
c25m3n07-ib,c25m3n08-ib ]
Chunk 17 (offset 35651584) is located at disks:  [ DMD_NSD06 
c25m3n08-ib,c25m3n07-ib ]
Chunk 18 (offset 37748736) is located at disks:  [ DMD_NSD07 
c25m3n07-ib,c25m3n08-ib ]
Chunk 19 (offset 39845888) is located at disks:  [ DMD_NSD08 
c25m3n08-ib,c25m3n07-ib ]

[SUMMARY INFO]
--
Replica num  Nodename   TotalChunkst

Replica 1 : c25m3n07-ib,c25m3n08-ib:  Total : 10
Replica 1 : c25m3n08-ib,c25m3n07-ib:  Total : 10

Best Regards,
-Kums






From:   
To: 
Date:   03/29/2017 08:00 PM
Subject:Re: [gpfsug-discuss] question on viewing block 
distribution across NSDs
Sent by:gpfsug-discuss-boun...@spectrumscale.org



I was going to keep mmdf in mind, not gpfs.snap. I will now also keep in 
mind that mmdf can have an impact as at present we have spinning disk for 
metadata. The system I am playing around on is not production yet, so I am 
safe for the moment.
 
Thanks again.
 
From: gpfsug-discuss-boun...@spectrumscale.org [
mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Knister, 
Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]
Sent: Thursday, 30 March 2017 9:55 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] question on viewing block distribution 
across NSDs
 
I don't necessarily think you need to run a snap prior, just the output of 
mmdf should be enough. Something to keep in mind that I should have said 
before-- an mmdf can be stressful on your system particularly if you have 
spinning disk for your metadata. We're fortunate enough to have all flash 
for our metadata and I tend to take it for granted some times :)
 
From: greg.lehm...@csiro.au
Sent: 3/29/17, 19:52
To: gpfsug main discussion list
Subject: