Re: [gpfsug-discuss] Change uidNumber and gidNumber for billions of files

2020-06-10 Thread Lohit Valleru
Thank you everyone for the Inputs.

The answers to some of the questions are as follows:

> From Jez: I've done this a few times in the past in a previous life.  In many 
> respects it is easier (and faster!) to remap the AD side to the uids already 
> on the filesystem.
- Yes we had considered/attempted this, and it does work pretty good. It is 
actually much faster than using SSSD auto id mapping.
But the main issue with this approach was to automate entry of uidNumbers and 
gidNumbers for all the enterprise users/groups across the agency. Both the 
approaches have there pros and cons.

For now, we wanted to see the amount of effort that would be needed to change 
the uidNumbers and gidNumbers on the filesystem side, in case the other option 
of entering existing uidNumber/gidNumber data on AD does not work out.

> Does the filesystem have ACLs? And which ACLs?
 Since we have CES servers that export the filesystems on SMB protocol -> The 
filesystems use NFS4 ACL mode.
As far as we know - We know of only one fileset that is extensively using NFS4 
ACLs.

> Can we take a downtime to do this change?
For the current GPFS storage clusters which are are production - we are 
thinking of taking a downtime to do the same per cluster. For new 
clusters/storage clusters, we are thinking of changing to AD before any new 
data is written to the storage. 

> Do the uidNumbers/gidNumbers conflict?
No. The current uidNumber and gidNumber are in 1000 - 8000 range, while the new 
uidNumbers,gidNumbers are above 100. 

I was thinking of taking a backup of the current state of the filesystem, with 
respect to posix permissions/owner/group and the respective quotas. Disable 
quotas with a downtime before making changes.

I might mostly start small with a single lab, and only change files without 
ACLs.
May I know if anyone has a method/tool to find out which files/dirs have NFS4 
ACLs set? As far as we know - it is just one fileset/lab, but it would be good 
to confirm if we have them set across any other files/dirs in the filesystem. 
The usual methods do not seem to work.  

Jonathan/Aaron,
Thank you for the inputs regarding the scripts/APIs/symlinks and ACLs. I will 
try to see what I can do given the current state.
I too wish GPFS API could be better at managing this kind of scenarios  but I 
understand that this kind of huge changes might be pretty rare.

Thank you,
Lohit

On June 10, 2020 at 6:33:45 AM, Jonathan Buzzard 
(jonathan.buzz...@strath.ac.uk) wrote:

On 10/06/2020 02:15, Aaron Knister wrote:  
> Lohit,  
>  
> I did this while working @ NASA. I had two tools I used, one  
> affectionately known as "luke file walker" (to modify traditional unix  
> permissions) and the other known as the "milleniumfacl" (to modify posix  
> ACLs). Stupid jokes aside, there were some real technical challenges here.  
>  
> I don't know if anyone from the NCCS team at NASA is on the list, but if  
> they are perhaps they'll jump in if they're willing to share the code :)  
>  
> From what I recall, I used uthash and the gpfs API's to store in-memory  
> a hash of inodes and their uid/gid information. I then walked the  
> filesystem using the gpfs API's and could lookup the given inode in the  
> in-memory hash to view its ownership details. Both the inode traversal  
> and directory walk were parallelized/threaded. They way I actually  
> executed the chown was particularly security-minded. There is a race  
> condition that exists if you chown /path/to/file. All it takes is either  
> a malicious user or someone monkeying around with the filesystem while  
> it's live to accidentally chown the wrong file if a symbolic link ends  
> up in the file path.  

Well I would expect this needs to be done with no user access to the  
system. Or at the very least no user access for the bits you are  
currently modifying. Otherwise you are going to end up in a complete mess.  

> My work around was to use openat() and fchmod (I  
> think that was it, I played with this quite a bit to get it right) and  
> for every path to be chown'd I would walk the hierarchy, opening each  
> component with the O_NOFOLLOW flags to be sure I didn't accidentally  
> stumble across a symlink in the way.  

Or you could just use lchown so you change the ownership of the symbolic  
link rather than the file it is pointing to. You need to change the  
ownership of the symbolic link not the file it is linking to, that will  
be picked up elsewhere in the scan. If you don't change the ownership of  
the symbolic link you are going to be left with a bunch of links owned  
by none existent users. No race condition exists if you are doing it  
properly in the first place :-)  

I concluded that the standard nftw system call was more suited to this  
than the GPFS inode scan. I could see no way to turn an inode into a  
path to the file which lchownn, gpfs_getacl and gpfs_putacl all use.  

I think the problem with the GPFS inode scan is that is is for a 

[gpfsug-discuss] Change uidNumber and gidNumber for billions of files

2020-06-08 Thread Lohit Valleru
Hello Everyone,

We are planning to migrate from LDAP to AD, and one of the best solution was to 
change the uidNumber and gidNumber to what SSSD or Centrify would resolve.

May I know, if anyone has come across a tool/tools that can change the 
uidNumbers and gidNumbers of billions of files efficiently and in a reliable 
manner?
We could spend some time to write a custom script, but wanted to know if a tool 
already exists.

Please do let me know, if any one else has come across a similar situation, and 
the steps/tools used to resolve the same.

Regards,
Lohit___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Network switches/architecture for GPFS

2020-03-20 Thread Valleru, Lohit/Information Systems
Hello All,

I would like to discuss or understand on which ethernet networking 
switches/architecture seems to work best with GPFS. 
We had thought about infiniband, but are not yet ready to move to infiniband 
because of the complexity/upgrade and debugging issues that come with it. 

Current hardware:

We are currently using Arista 7328x 100G core switch for networking among the 
GPFS clusters and the compute nodes.

It is heterogeneous network, with some of the servers on 10G/25G/100G with LACP 
and without LACP.

For example: 

GPFS storage clusters either have 25G LACP, or 10G LACP, or a single 100G 
network port.
Compute nodes range from 10G to 100G.
Login nodes/transfer servers etc have 25G bonded.

Most of the servers have Mellanox ConnectX-4 or ConnectX-5 adapters. But we 
also have few older Intel,Broadcom and Chelsio network cards in the clusters.

Most of the transceivers that we use are Mellanox,Finisar,Intel.

Issue:

We had upgraded to the above switch recently, and we had seen that it is not 
able to handle the network traffic because of higher NSD servers bandwidth vs 
lower compute node bandwidth.

One issue that we did see was a lot of network discards on the switch side and 
network congestion with slow IO performance on respective compute nodes.

Once we enabled ECN - we did see that it had reduced the network congestion.

We do see expels once in a while, but that is mostly related to the network 
errors or the host not responding. We observed that bonding/LACP does make 
expels much more trickier, so we have decided to go with no LACP until GPFS 
code gets better at handling LACP - which I think they are working on.

We have heard that our current switch is a shallow buffer switch, and we would 
need a higher/deep buffer Arista switch to perform better with no 
congestion/lesser latency and more throughput.

On the other side, Mellanox promises to use better ASIC design and buffer 
architecture with spine leaf design, instead of one deep buffer core switch to 
get better performance than Arista.

Most of the applications that run on the clusters are either genomic 
applications on CPUs and deep learning applications on GPUs. 

All of our GPFS storage cluster versions are above 5.0.2 with the compute 
filesystems at 16M block size on near line rotating disks, and Flash storage at 
512K block size.


May I know if could feedback from anyone who is using Arista or Mellanox 
switches on the clusters to understand the pros and cons, stability and the 
performance numbers of the same?


Thank you,
Lohit___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Maxblocksize tuning alternatives/max number of buffers

2020-02-28 Thread Valleru, Lohit/Information Systems
Hello Anderson,

This application requires minimum throughput of about 10-13MB/s initially and 
almost no IOPS during first phase where it opens all the files and reads the 
headers and about 30MB/s throughput during the second phase.
The issue that I face is during the second phase where it tries to randomly 
read about 4K of block size from random files from 2 to about 10.
In this phase - I see a big difference in maxblocksize parameter changing the 
performance of the reads, with almost no throughput and may be around 2-4K IOPS.

This issue is a follow up to the previous issue that I had mentioned about an 
year ago - where I see differences in performance - “though there is 
practically no IO to the storage”
I mean - I see a difference  in performance between different FS block-sizes 
even if all data is cached in pagepool.
Sven had replied to that thread mentioning that it could be because of buffer 
locking issue.
 
The info requested is as below: 

4 Storage clusters:

Storage cluster for compute:
5.0.3-2 GPFS version
FS version: 19.01 (5.0.1.0)
Subblock size: 16384
Blocksize : 16M 

Flash Storage Cluster for compute:
5.0.4-2 GPFS version
FS version: 18.00 (5.0.0.0)
Subblock size: 8192
Blocksize: 512K

Storage cluster for admin tools:
5.0.4-2 GPFS version
FS version: 16.00 (4.2.2.0)
Subblock size: 131072
Blocksize: 4M

Storage cluster for archival:
5.0.3-2 GPFS version
FS version: 16.00 (4.2.2.0)
Subblock size: 32K
Blocksize: 1M 

The only two clusters that users do/will do compute on is the 16M filesystem 
and the 512K Filesystem.

When you ask what is the throughput/IOPS and block size - it varies a lot and 
has not been recorded.
The 16M FS is capable of doing about 27GB/s seq read for about 1.8 PB of 
storage.
The 512K FS is capable of doing about 10-12GB/s seq read for about 100T of 
storage.

Now as I mentioned previously - the issue that I am seeing has been related to 
different FS block sizes on the same storage.
For example: 
On the Flash Storage cluster: 
Block size of 512K with maxblocksize of 16M gives worse performance than Block 
size of 512K with maxblocksize of 512K.
It is the maxblocksize that is affecting the performance, on the same storage 
with same block size and everything else being the same.
I am thinking the above is because of the number of buffers involved, but would 
like to learn if it happens to be anything else.
I have debugged the same with IBM GPFS techs and it has been found that there 
is no issue with the storage itself or any of the other GPFS tuning parameters.

Now since we do know that maxblocksize is making a big difference.
I would like to keep it as low as possible but still be able to mount other 
remote GPFS filesystems with higher block sizes.
Or since it is required to keep the maxblocksize the same across all storage - 
I would like to know if there is any other parameters that could do the same 
change as maxblocksize.


Thank you,
Lohit   



> On Feb 28, 2020, at 12:58 PM, Anderson Ferreira Nobre  
> wrote:
> 
> Hi Lohit,
>  
> First, a few questions to understand better your problem:
> - What is the minimum release level of both clusters?
> - What is the version of filesystem layout for 16MB, 1MB and 512KB?
> - What is the subblocksize of each filesystem?
> - How many IOPS, block size and throughput are you doing on each filesystem?
>  
> Abraços / Regards / Saludos,
>  
> Anderson Nobre
> Power and Storage Consultant
> IBM Systems Hardware Client Technical Team – IBM Systems Lab Services
> 
> 
>  
> Phone: 55-19-2132-4317
> E-mail: ano...@br.ibm.com <mailto:ano...@br.ibm.com>  
>  
>  
> - Original message -
> From: "Valleru, Lohit/Information Systems" 
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> To: gpfsug-discuss@spectrumscale.org
> Cc:
> Subject: [EXTERNAL] [gpfsug-discuss] Maxblocksize tuning alternatives/max 
> number of buffers
> Date: Fri, Feb 28, 2020 12:30
>  
> Hello Everyone,
> 
> I am looking for alternative tuning parameters that could do the same job as 
> tuning the maxblocksize parameter.
> 
> One of our users run a deep learning application on GPUs, that does the 
> following IO pattern:
> 
> It needs to read random small sections about 4K in size from about 20,000 to 
> 100,000 files of each 100M to 200M size.
> 
> When performance tuning for the above application on a 16M filesystem and 
> comparing it to various other file system block sizes - I realized that the 
> performance degradation that I see might be related to the number of buffers.
> 
> I observed that the performance varies widely depending on what maxblocksize 
> parameter I use.
> For example, using a 16M maxblocksize for a 512K or a 1M block size 
> filesystem differs widely from using a 512K or 1M maxblocksize for a  512K or 
> a 1M block size filesystem

[gpfsug-discuss] Maxblocksize tuning alternatives/max number of buffers

2020-02-28 Thread Valleru, Lohit/Information Systems
Hello Everyone,

I am looking for alternative tuning parameters that could do the same job as 
tuning the maxblocksize parameter.

One of our users run a deep learning application on GPUs, that does the 
following IO pattern:

It needs to read random small sections about 4K in size from about 20,000 to 
100,000 files of each 100M to 200M size.

When performance tuning for the above application on a 16M filesystem and 
comparing it to various other file system block sizes - I realized that the 
performance degradation that I see might be related to the number of buffers.

I observed that the performance varies widely depending on what maxblocksize 
parameter I use.
For example, using a 16M maxblocksize for a 512K or a 1M block size filesystem 
differs widely from using a 512K or 1M maxblocksize for a  512K or a 1M block 
size filesystem.

The reason I believe might be related to the number of buffers that I could 
keep on the client side, but I am not sure if that is the all that the 
maxblocksize is affecting.

We have different file system block sizes in our environment ranging from 512K, 
1M and 16M.

We also use storage clusters and compute clusters design.

Now in order to mount the 16M filesystem along with the other filesystems on 
compute clusters - we had to keep the maxblocksize to be 16M - no matter what 
the file system block size.

I see that I get maximum performance for this application from a 512K block 
size filesystem and a 512K maxblocksize.
However, I will not be able to mount this filesystem along with the other 
filesystems because I will need to change the maxblocksize to 16M in order to 
mount the other filesystems of 16M block size.

I am thinking if there is anything else that can do the same job as 
maxblocksize parameter.

I was thinking about the parameters like maxBufferDescs for a 16M maxblocksize, 
but I believe it would need a lot more pagepool to keep the same number of 
buffers as would be needed for a 512k maxblocksize.

May I know if there is any other parameter that could help me the same as 
maxblocksize, and the side effects of the same?

Thank you,
Lohit
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 81, Issue 43

2019-11-20 Thread valleru
body will point this out
> >>> soon ;-)
> >>>
> >>> sven
> >>>
> >>>
> >>>
> >>>
> >>> On Tue, Sep 18, 2018 at 10:31 AM  wrote:
> >>>
> >>>> Hello All,
> >>>>
> >>>> This is a continuation to the previous discussion that i had with Sven.
> >>>> However against what i had mentioned previously - i realize that this
> >>>> is ?not? related to mmap, and i see it when doing random freads.
> >>>>
> >>>> I see that block-size of the filesystem matters when reading from Page
> >>>> pool.
> >>>> I see a major difference in performance when compared 1M to 16M, when
> >>>> doing lot of random small freads with all of the data in pagepool.
> >>>>
> >>>> Performance for 1M is a magnitude ?more? than the performance that i
> >>>> see for 16M.
> >>>>
> >>>> The GPFS that we have currently is :
> >>>> Version : 5.0.1-0.5
> >>>> Filesystem version: 19.01 (5.0.1.0)
> >>>> Block-size : 16M
> >>>>
> >>>> I had made the filesystem block-size to be 16M, thinking that i would
> >>>> get the most performance for both random/sequential reads from 16M than 
> >>>> the
> >>>> smaller block-sizes.
> >>>> With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus
> >>>> not loose lot of storage space even with 16M.
> >>>> I had run few benchmarks and i did see that 16M was performing better
> >>>> ?when hitting storage/disks? with respect to bandwidth for
> >>>> random/sequential on small/large reads.
> >>>>
> >>>> However, with this particular workload - where it freads a chunk of
> >>>> data randomly from hundreds of files -> I see that the number of
> >>>> page-faults increase with block-size and actually reduce the performance.
> >>>> 1M performs a lot better than 16M, and may be i will get better
> >>>> performance with less than 1M.
> >>>> It gives the best performance when reading from local disk, with 4K
> >>>> block size filesystem.
> >>>>
> >>>> What i mean by performance when it comes to this workload - is not the
> >>>> bandwidth but the amount of time that it takes to do each iteration/read
> >>>> batch of data.
> >>>>
> >>>> I figure what is happening is:
> >>>> fread is trying to read a full block size of 16M - which is good in a
> >>>> way, when it hits the hard disk.
> >>>> But the application could be using just a small part of that 16M. Thus
> >>>> when randomly reading(freads) lot of data of 16M chunk size - it is page
> >>>> faulting a lot more and causing the performance to drop .
> >>>> I could try to make the application do read instead of freads, but i
> >>>> fear that could be bad too since it might be hitting the disk with a very
> >>>> small block size and that is not good.
> >>>>
> >>>> With the way i see things now -
> >>>> I believe it could be best if the application does random reads of
> >>>> 4k/1M from pagepool but some how does 16M from rotating disks.
> >>>>
> >>>> I don?t see any way of doing the above other than following a different
> >>>> approach where i create a filesystem with a smaller block size ( 1M or 
> >>>> less
> >>>> than 1M ), on SSDs as a tier.
> >>>>
> >>>> May i please ask for advise, if what i am understanding/seeing is right
> >>>> and the best solution possible for the above scenario.
> >>>>
> >>>> Regards,
> >>>> Lohit
> >>>>
> >>>> On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru ,
> >>>> wrote:
> >>>>
> >>>> Hey Sven,
> >>>>
> >>>> This is regarding mmap issues and GPFS.
> >>>> We had discussed previously of experimenting with GPFS 5.
> >>>>
> >>>> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2
> >>>>
> >>>> I am yet to experiment with mmap performance, but before that - I am
> >>>> seeing weird hangs with GPFS 5 and I think it could be related to mmap.
> >>>>
> >>&g

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-08 Thread valleru
Thank you Marc. I was just trying to suggest another approach to this email 
thread.

However i believe, we cannot run mmfind/mmapplypolicy with remote filesystems 
and can only be run on the owning cluster? In our clusters - All the gpfs 
clients are generally in there own compute clusters and mount filesystems from 
other storage clusters - which i thought is one of the recommended designs.
The scripts in the /usr/lpp/mmfs/samples/util folder do work with remote 
filesystems, and thus on the compute nodes.

I was also trying to find something that could be used by users and not by 
superuser… but i guess none of these tools are meant to be run by a user 
without superuser privileges.

Regards,
Lohit

On Mar 8, 2019, 3:54 PM -0600, Marc A Kaplan , wrote:
> Lohit... Any and all of those commands and techniques should still work with 
> newer version of GPFS.
>
> But mmapplypolicy is the supported command for generating file lists.  It 
> uses the GPFS APIs and some parallel processing tricks.
>
> mmfind is a script that make it easier to write GPFS "policy rules" and runs 
> mmapplypolicy for you.
>
> mmxcp can be used with mmfind (and/or mmapplypolicy) to make it easy to run a 
> cp (or other command) in parallel on those filelists ...
>
> --marc K of GPFS
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        ""gpfsug-discuss<""gpfsug-discuss@spectrumscale.org         
> ", gpfsug main discussion list 
> 
> Date:        03/08/2019 10:13 AM
> Subject:        Re: [gpfsug-discuss] Follow-up: migrating billions of files
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as 
> possible when NSD disk descriptors were corrupted and shutting down GPFS 
> would have led to me loosing those files forever, and the other was a regular 
> maintenance but had to copy similar data in less time.
>
> In both the cases, i just used GPFS provided util scripts in 
> /usr/lpp/mmfs/samples/util/  . These could be run only as root i believe. I 
> wish i could give them to users to use.
>
> I had used few of those scripts like tsreaddir which used to be really fast 
> in listing all the paths in the directories. It prints full paths of all 
> files along with there inodes etc. I had modified it to print just the full 
> file paths.
>
> I then use these paths and group them up in different groups which gets fed 
> into a array jobs to the SGE/LSF cluster.
> Each array jobs basically uses GNU parallel and running something similar to 
> rsync -avR . The “-R” option basically creates the directories as given.
> Of course this worked because i was using the fast private network to 
> transfer between the storage systems. Also i know that cp or tar might be 
> better than rsync with respect to speed, but rsync was convenient and i could 
> always start over again without checkpointing or remembering where i left off 
> previously.
>
> Similar to how Bill mentioned in the previous email, but i used gpfs util 
> scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster 
> as superuser. It used to work pretty well.
>
> Since then - I constantly use parallel and rsync to copy large directories.
>
> Thank you,
> Lohit
>
> On Mar 8, 2019, 7:43 AM -0600, William Abbott , wrote:
> We had a similar situation and ended up using parsyncfp, which generates
> multiple parallel rsyncs based on file lists. If they're on the same IB
> fabric (as ours were) you can use that instead of ethernet, and it
> worked pretty well. One caveat is that you need to follow the parallel
> transfers with a final single rsync, so you can use --delete.
>
> For the initial transfer you can also use bbcp. It can get very good
> performance but isn't nearly as convenient as rsync for subsequent
> transfers. The performance isn't good with small files but you can use
> tar on both ends to deal with that, in a similar way to what Uwe
> suggests below. The bbcp documentation outlines how to do that.
>
> Bill
>
> On 3/6/19 8:13 AM, Uwe Falke wrote:
> Hi, in that case I'd open several tar pipes in parallel, maybe using
> directories carefully selected, like
>
> tar -c  | ssh  "tar -x"
>
> I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but
> along these lines might be a good efficient method. target_hosts should be
> all nodes haveing the target file system mounted, and you should start
> those pipes on the nodes with the source file system.
> It is best to start with the largest directories, and use some
> masterscript to start the tar pipes controlled by semaphores to not
> overload anything.
>
>
>
> Mit freundlichen Grüßen / Kind regards
>
>
> Dr. Uwe Falke
>
> IT Specialist
> High Performance Computing Services / Integrated Technology Services /
> Data Center Services
> ---
> IBM Deutschland
> Rathausstr. 

Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share

2019-03-08 Thread valleru
Well, reading the user-defined authentication documentation again. It is 
basically left to sysadmins to deal with authentication and it looks like it 
would not be so much of a hack, to customize smb on CES nodes according to our 
needs.
I will see if i could do this without much trouble.

Regards,
Lohit

On Mar 8, 2019, 10:42 AM -0600, vall...@cbio.mskcc.org, wrote:
> Thank you Simon.
>
> I do remember reading your page about few years back, when i was researching 
> this issue.
> When you mentioned Custom Auth. I assumed it to be user-defined 
> authentication from CES. However, looks like i need to hack it a bit to get 
> SMB working with AD?
>
> I did not feel comfortable hacking the SMB from the CES cluster, and thus i 
> was trying to bring up SMB outside the CES cluster. I almost hack with 
> everything in the cluster but i leave GPFS and any of its configuration in 
> the supported config, because if things break - i felt it might mess up 
> things real bad.
> I wish we do not have to hack our way out of this, and IBM supported this 
> config out of the box.
>
> I do not understand the current requirements from CES with respect to AD or 
> user defined authentication where either both SMB and NFS should be AD/LDAP 
> authenticated or both of them user defined.
>
> I believe many places do use just ssh-key as authentication for linux 
> machines including the cloud instances, while SMB obviously cannot be used 
> with ssh-key authentication and has to be used either with LDAP or AD 
> authentication.
>
> Did anyone try to raise this as a feature request?
>
> Even if i do figure to hack this thing and make sure that updating CES won’t 
> mess it up badly. I think i will have to do few things to get the SIDs to 
> Uids match as you mentioned.
> We do not use passwords to authenticate to LDAP and I do not want to be 
> creating another set of passwords apart from AD which is already existing, 
> and users authenticate to it when they login to machines.
>
> I was thinking to bring up something like Redhat IDM that could sync with AD 
> and get all the usernames/sids and password hashes. I could then enter my 
> current LDAP uids/gids in the Redhat IDM. IDM will automatically create 
> uids/gids for usernames that do not have them i believe.
> In this way, when SMB authenticates with Redhat IDM - users can use there 
> current AD kerberos tickets or the same passwords and i do not have to change 
> the passwords.
> It will also automatically sync with AD and create UIDs/GIDs and thus i don’t 
> have to manually script something to create one for every person in AD.
> I however need to see if i could get to make this work with institutional AD 
> and it might not be as smooth.
>
> So which of the below cases will IBM most probably support? :)
>
> 1. Run SMB outside the CES cluster with the above configuration.
> 2. Hack SMB inside the CES cluster
>
> Is it that running SMB outside the CES cluster with R/W has a possibility of 
> corrupting the GPFS filesystem?
> We do not necessarily need HA with SMB and so apart from HA - What does IBM 
> SMB do that would prevent such corruption from happening?
>
> The reason i was expecting the usernames to be same in LDAP and AD is because 
> - if they are, then SMB will do uid mapping by default. i.e SMB will 
> automatically map windows sids to ldap uids. I will not have to bring up 
> Redhat IDM if this was the case. But unfortunately we have many users who 
> have different ldap usernames from AD usernames - so i guess the practical 
> way would be to use Redhat IDM to map windows sids to ldap uids.
>
> I have read about mmname2uid and mmuid2name that Andrew mentioned but looks 
> like it is made to work between 2 gpfs clusters with different uids. Not 
> exactly to make SMB map windows SIDs to ldap uids.
>
> Regards,
> Lohit
>
> On Mar 8, 2019, 2:41 AM -0600, Simon Thompson , 
> wrote:
> > Hi Lohit,
> >
> > Custom auth sounds like it would work.
> >
> > NFS uses the “system” ldap, SMB can use LDAP or AD, or you can fudge it and 
> > actually use both. We came at this very early in CES and I think some of 
> > this is better in mixed mode now, but we do something vaguely related to 
> > what you need.
> >
> > What you’d need is data in your ldap server to map windows usernames and 
> > SIDs to Unix IDs. So for example we have in our mmsmb config:
> > idmap config * : backend   ldap
> > idmap config * : bind_path_group   
> > ou=SidMap,dc=rds,dc=adf,dc=bham,dc=ac,dc=uk
> > idmap config * : ldap_base_dn  
> > ou=SidMap,dc=rds,dc=adf,dc=bham,dc=ac,dc=uk
> > idmap config * : ldap_server   stand-alone
> > idmap config * : ldap_url  ldap://localhost
> > idmap config * : ldap_user_dn  
> > uid=nslcd,ou=People,dc=rds,dc=adf,dc=bham,dc=ac,dc=uk
> > idmap config * : range 1000-999
> > idmap config * : rangesize 100
> > idmap config * : read only yes
> >
> > You then need entries in the LDAP server, it could 

Re: [gpfsug-discuss] Follow-up: migrating billions of files

2019-03-08 Thread valleru
I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as 
possible when NSD disk descriptors were corrupted and shutting down GPFS would 
have led to me loosing those files forever, and the other was a regular 
maintenance but had to copy similar data in less time.

In both the cases, i just used GPFS provided util scripts in 
/usr/lpp/mmfs/samples/util/  . These could be run only as root i believe. I 
wish i could give them to users to use.

I had used few of those scripts like tsreaddir which used to be really fast in 
listing all the paths in the directories. It prints full paths of all files 
along with there inodes etc. I had modified it to print just the full file 
paths.

I then use these paths and group them up in different groups which gets fed 
into a array jobs to the SGE/LSF cluster.
Each array jobs basically uses GNU parallel and running something similar to 
rsync -avR . The “-R” option basically creates the directories as given.
Of course this worked because i was using the fast private network to transfer 
between the storage systems. Also i know that cp or tar might be better than 
rsync with respect to speed, but rsync was convenient and i could always start 
over again without checkpointing or remembering where i left off previously.

Similar to how Bill mentioned in the previous email, but i used gpfs util 
scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster as 
superuser. It used to work pretty well.

Since then - I constantly use parallel and rsync to copy large directories.

Thank you,
Lohit

On Mar 8, 2019, 7:43 AM -0600, William Abbott , wrote:
> We had a similar situation and ended up using parsyncfp, which generates
> multiple parallel rsyncs based on file lists. If they're on the same IB
> fabric (as ours were) you can use that instead of ethernet, and it
> worked pretty well. One caveat is that you need to follow the parallel
> transfers with a final single rsync, so you can use --delete.
>
> For the initial transfer you can also use bbcp. It can get very good
> performance but isn't nearly as convenient as rsync for subsequent
> transfers. The performance isn't good with small files but you can use
> tar on both ends to deal with that, in a similar way to what Uwe
> suggests below. The bbcp documentation outlines how to do that.
>
> Bill
>
> On 3/6/19 8:13 AM, Uwe Falke wrote:
> > Hi, in that case I'd open several tar pipes in parallel, maybe using
> > directories carefully selected, like
> >
> > tar -c  | ssh  "tar -x"
> >
> > I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but
> > along these lines might be a good efficient method. target_hosts should be
> > all nodes haveing the target file system mounted, and you should start
> > those pipes on the nodes with the source file system.
> > It is best to start with the largest directories, and use some
> > masterscript to start the tar pipes controlled by semaphores to not
> > overload anything.
> >
> >
> >
> > Mit freundlichen Grüßen / Kind regards
> >
> >
> > Dr. Uwe Falke
> >
> > IT Specialist
> > High Performance Computing Services / Integrated Technology Services /
> > Data Center Services
> > ---
> > IBM Deutschland
> > Rathausstr. 7
> > 09111 Chemnitz
> > Phone: +49 371 6978 2165
> > Mobile: +49 175 575 2877
> > E-Mail: uwefa...@de.ibm.com
> > ---
> > IBM Deutschland Business & Technology Services GmbH / Geschäftsführung:
> > Thomas Wolter, Sven Schooß
> > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> > HRB 17122
> >
> >
> >
> >
> > From: "Oesterlin, Robert"  > To: gpfsug main discussion list  > Date: 06/03/2019 13:44
> > Subject: [gpfsug-discuss] Follow-up: migrating billions of files
> > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> >
> >
> >
> > Some of you had questions to my original post. More information:
> >
> > Source:
> > - Files are straight GPFS/Posix - no extended NFSV4 ACLs
> > - A solution that requires $?s to be spent on software (ie, Aspera) isn?t
> > a very viable option
> > - Both source and target clusters are in the same DC
> > - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage
> > - Approx 40 file systems, a few large ones with 300M-400M files each,
> > others smaller
> > - no independent file sets
> > - migration must pose minimal disruption to existing users
> >
> > Target architecture is a small number of file systems (2-3) on ESS with
> > independent filesets
> > - Target (ESS) will have multiple 40gb-E links on each NSD server (GS4)
> >
> > My current thinking is AFM with a pre-populate of the file space and
> > switch the clients over to have them pull data they need (most of the data
> > is older and 

Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share

2019-03-07 Thread valleru
Thanks a lot Andrew.

It does look promising but It does not strike me immediately on how this could 
solve the SMB export where user authenticates with an AD username but the gpfs 
files that are present are owned by LDAP username.
May be you are saying that if i enable GPFS to use these scripts - then GPFS 
will map the AD username to the LDAP username?

I found this url too..

https://www.ibm.com/support/knowledgecenter/en/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_uid/uid_gpfs.html

I will give it a read, try to understand how to implement it and get back if i 
have any more questions.

If this works, it should help me configure and use the CES SMB. (Hopefully, CES 
file based authentication will allow both ssh key authentication for NFS and AD 
for SMB in same CES cluster).

Regards,
Lohit

On Mar 7, 2019, 4:52 PM -0600, Andrew Beattie , wrote:
> Lohit
>
> Have you looked at mmUIDtoName mmNametoUID
>
> Yes it will require some custom scripting on your behalf but it would be a 
> far more elegant solution and not run the risk of data corruption issues.
>
> There is at least one university on this mailing list that is doing exactly 
> what you are talking about, and they successfully use
> mmUIDtoName / mmNametoUID  to provide the relevant mapping between different 
> authentication environments - both internally in the university and 
> externally from other institutions.
>
> They use AFM to move data between different storage clusters, and mmUIDtoName 
> / mmNametoUID, to manage the ACL and permissions, they then move the data 
> from the AFM filesystem to the HPC scratch filesystem for processing by the 
> HPC (different filesystems within the same cluster)
>
>
> Regards,
> Andrew Beattie
> File and Object Storage Technical Specialist - A/NZ
> IBM Systems - Storage
> Phone: 614-2133-7927
> E-mail: abeat...@au1.ibm.com
>
>
> > - Original message -
> > From: vall...@cbio.mskcc.org
> > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> > To: gpfsug-discuss@spectrumscale.org, gpfsug main discussion list 
> > 
> > Cc:
> > Subject: Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB 
> > share
> > Date: Fri, Mar 8, 2019 8:21 AM
> >
> > We have many current usernames from LDAP that do not exactly match with the 
> > usernames from AD.
> > Unfortunately, i guess CES SMB will need us to use either AD or LDAP or use 
> > the same usernames in both AD and LDAP.
> > I have been looking for a solution where could map the different usernames 
> > from LDAP and AD but have not found a solution. So exploring ways to do 
> > this from RHEL SMB.
> > I would appreciate if you have any solution to this issue.
> >
> > As of now we use LDAP uids/gids and SSH keys for authentication to the HPC 
> > cluster.
> > We want to use CES SMB to export the same mounts which have LDAP 
> > usernames/uids/gids however because of different usernames in AD - it has 
> > become a challenge.
> > Even if we do find a solution to this, i want to be able to use AD 
> > authentication for SMB and ssh key authentication for NFS.
> >
> > The above are the reasons we are just using CES with NFS and user defined 
> > authentication for users to have access with login through ssh keys.
> >
> > Regards,
> > Lohit
> >
> > On Mar 7, 2019, 3:12 PM -0600, Andrew Beattie , wrote:
> > > That would not be supported
> > >
> > > You shouldn't publish a remote mount Protocol cluster , and then connect 
> > > a native client to that cluster and create a non CES protocol export
> > > if you are going to use a Protocol cluster that's how you present your 
> > > protocols.
> > > otherwise don't set up the remote mount cluster.
> > >
> > > Why are you trying to publish a non HA RHEL SMB share instead of using 
> > > the HA CES protocols?
> > > Andrew Beattie
> > > File and Object Storage Technical Specialist - A/NZ
> > > IBM Systems - Storage
> > > Phone: 614-2133-7927
> > > E-mail: abeat...@au1.ibm.com
> > >
> > >
> > > > - Original message -
> > > > From: vall...@cbio.mskcc.org
> > > > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> > > > To: gpfsug-discuss@spectrumscale.org, gpfsug main discussion list 
> > > > 
> > > > Cc:
> > > > Subject: Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces 
> > > > SMB share
> > > > Date: Fri, Mar 8, 2019 7:05 AM
> > > >
> > > > Thank you Andrew.
> > > >
> > > > However, we are not using SMB from the CES cluster but instead running 
> > > > a Redhat based SMB on a GPFS client of the CES cluster and exporting it 
> > > > from the GPFS client.
> > > > Is the above supported, and not known to cause any issues?
> > > >
> > > > Regards,
> > > > Lohit
> > > >
> > > > On Mar 7, 2019, 2:45 PM -0600, Andrew Beattie , 
> > > > wrote:
> > > > >
> > > > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_configprotocolsonremotefs.htm
> > > > ___
> > > > gpfsug-discuss mailing list
> > > > 

Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share

2019-03-07 Thread valleru
We have many current usernames from LDAP that do not exactly match with the 
usernames from AD.
Unfortunately, i guess CES SMB will need us to use either AD or LDAP or use the 
same usernames in both AD and LDAP.
I have been looking for a solution where could map the different usernames from 
LDAP and AD but have not found a solution. So exploring ways to do this from 
RHEL SMB.
I would appreciate if you have any solution to this issue.

As of now we use LDAP uids/gids and SSH keys for authentication to the HPC 
cluster.
We want to use CES SMB to export the same mounts which have LDAP 
usernames/uids/gids however because of different usernames in AD - it has 
become a challenge.
Even if we do find a solution to this, i want to be able to use AD 
authentication for SMB and ssh key authentication for NFS.

The above are the reasons we are just using CES with NFS and user defined 
authentication for users to have access with login through ssh keys.

Regards,
Lohit

On Mar 7, 2019, 3:12 PM -0600, Andrew Beattie , wrote:
> That would not be supported
>
> You shouldn't publish a remote mount Protocol cluster , and then connect a 
> native client to that cluster and create a non CES protocol export
> if you are going to use a Protocol cluster that's how you present your 
> protocols.
> otherwise don't set up the remote mount cluster.
>
> Why are you trying to publish a non HA RHEL SMB share instead of using the HA 
> CES protocols?
> Andrew Beattie
> File and Object Storage Technical Specialist - A/NZ
> IBM Systems - Storage
> Phone: 614-2133-7927
> E-mail: abeat...@au1.ibm.com
>
>
> > - Original message -
> > From: vall...@cbio.mskcc.org
> > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> > To: gpfsug-discuss@spectrumscale.org, gpfsug main discussion list 
> > 
> > Cc:
> > Subject: Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB 
> > share
> > Date: Fri, Mar 8, 2019 7:05 AM
> >
> > Thank you Andrew.
> >
> > However, we are not using SMB from the CES cluster but instead running a 
> > Redhat based SMB on a GPFS client of the CES cluster and exporting it from 
> > the GPFS client.
> > Is the above supported, and not known to cause any issues?
> >
> > Regards,
> > Lohit
> >
> > On Mar 7, 2019, 2:45 PM -0600, Andrew Beattie , wrote:
> > >
> > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_configprotocolsonremotefs.htm
> > ___
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share

2019-03-07 Thread valleru
Thank you Andrew.

However, we are not using SMB from the CES cluster but instead running a Redhat 
based SMB on a GPFS client of the CES cluster and exporting it from the GPFS 
client.
Is the above supported, and not known to cause any issues?

Regards,
Lohit

On Mar 7, 2019, 2:45 PM -0600, Andrew Beattie , wrote:
>
> https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_configprotocolsonremotefs.htm
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share

2019-03-07 Thread valleru
Hello All,

We are thinking of exporting “remote" GPFS mounts on a remote GPFS 5.0 cluster 
through a SMB share.

I have heard in a previous thread that it is not a good idea to export NFS/SMB 
share on a remote GPFS mount, and make it writable.

The issue that could be caused by making it writable would be metanode swapping 
between the GPFS clusters.

May i understand this better and the seriousness of this issue?

The possibility of a single file being written at the same time from a GPFS 
node and NFS/SMB node is minimum - however it is possible that a file is 
written at the same time from multiple protocols by mistake and we cannot 
prevent it.

This is the setup:

GPFS storage cluster: /gpfs01
GPFS CES cluster ( does not have any storage) : /gpfs01 -> mounted remotely . 
NFS export /gpfs01 as part of CES cluster
GPFS client for CES cluster -> Acts as SMB server and exports /gpfs01 over SMB

Are there any other limitations that i need to know for the above setup?

We cannot use GPFS CES SMB as of now for few other reasons such as LDAP/AD id 
mapping and authentication complications.

Regards,
Lohit
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

2018-11-02 Thread valleru
Also - You could just upgrade one of the clients to this version, and test to 
see if the hang still occurs.
You do not have to upgrade the NSD servers, to test.

Regards,
Lohit

On Nov 2, 2018, 12:29 PM -0400, vall...@cbio.mskcc.org, wrote:
> Yes,
>
> We have upgraded to 5.0.1-0.5, which has the patch for the issue.
> The related IBM case number was : TS001010674
>
> Regards,
> Lohit
>
> On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems 
> , wrote:
> > Hi Damir,
> >
> > It was related to specific user jobs and mmap (?). We opened PMR with IBM 
> > and have patch from IBM, since than we don’t see issue.
> >
> > Regards,
> >
> > Sveta.
> >
> > > On Nov 2, 2018, at 11:55 AM, Damir Krstic  wrote:
> > >
> > > Hi,
> > >
> > > Did you ever figure out the root cause of the issue? We have recently 
> > > (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64
> > >
> > > In the last few weeks we have seen an increasing number of ps hangs 
> > > across compute and login nodes on our cluster. The filesystem version (of 
> > > all filesystems on our cluster) is:
> > >  -V                 15.01 (4.2.0.0)          File system version
> > >
> > > I am just wondering if anyone has seen this type of issue since you first 
> > > reported it and if there is a known fix for it.
> > >
> > > Damir
> > >
> > > > On Tue, May 22, 2018 at 10:43 AM  wrote:
> > > > > Hello All,
> > > > >
> > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a 
> > > > > month ago. We have not yet converted the 4.2.2.2 filesystem version 
> > > > > to 5. ( That is we have not run the mmchconfig release=LATEST command)
> > > > > Right after the upgrade, we are seeing many “ps hangs" across the 
> > > > > cluster. All the “ps hangs” happen when jobs run related to a Java 
> > > > > process or many Java threads (example: GATK )
> > > > > The hangs are pretty random, and have no particular pattern except 
> > > > > that we know that it is related to just Java or some jobs reading 
> > > > > from directories with about 60 files.
> > > > >
> > > > > I have raised an IBM critical service request about a month ago 
> > > > > related to this - PMR: 24090,L6Q,000.
> > > > > However, According to the ticket  - they seemed to feel that it might 
> > > > > not be related to GPFS.
> > > > > Although, we are sure that these hangs started to appear only after 
> > > > > we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2.
> > > > >
> > > > > One of the other reasons we are not able to prove that it is GPFS is 
> > > > > because, we are unable to capture any logs/traces from GPFS once the 
> > > > > hang happens.
> > > > > Even GPFS trace commands hang, once “ps hangs” and thus it is getting 
> > > > > difficult to get any dumps from GPFS.
> > > > >
> > > > > Also  - According to the IBM ticket, they seemed to have a seen a “ps 
> > > > > hang" issue and we have to run  mmchconfig release=LATEST command, 
> > > > > and that will resolve the issue.
> > > > > However we are not comfortable making the permanent change to 
> > > > > Filesystem version 5. and since we don’t see any near solution to 
> > > > > these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the 
> > > > > previous state that we know the cluster was stable.
> > > > >
> > > > > Can downgrading GPFS take us back to exactly the previous GPFS config 
> > > > > state?
> > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i 
> > > > > reinstall all rpms to a previous version? or is there anything else 
> > > > > that i need to make sure with respect to GPFS configuration?
> > > > > Because i think that GPFS 5.0 might have updated internal default 
> > > > > GPFS configuration parameters , and i am not sure if downgrading GPFS 
> > > > > will change them back to what they were in GPFS 4.2.3.2
> > > > >
> > > > > Our previous state:
> > > > >
> > > > > 2 Storage clusters - 4.2.3.2
> > > > > 1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage 
> > > > > clusters )
> > > > >
> > > > > Our current state:
> > > > >
> > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
> > > > > 1 Compute cluster - 5.0.0.2
> > > > >
> > > > > Do i need to downgrade all the clusters to go to the previous state ? 
> > > > > or is it ok if we just downgrade the compute cluster to previous 
> > > > > version?
> > > > >
> > > > > Any advice on the best steps forward, would greatly help.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Lohit
> > > > > ___
> > > > > gpfsug-discuss mailing list
> > > > > gpfsug-discuss at spectrumscale.org
> > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
> > ___
> > gpfsug-discuss mailing list
> > 

Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

2018-11-02 Thread valleru
Yes,

We have upgraded to 5.0.1-0.5, which has the patch for the issue.
The related IBM case number was : TS001010674

Regards,
Lohit

On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems 
, wrote:
> Hi Damir,
>
> It was related to specific user jobs and mmap (?). We opened PMR with IBM and 
> have patch from IBM, since than we don’t see issue.
>
> Regards,
>
> Sveta.
>
> > On Nov 2, 2018, at 11:55 AM, Damir Krstic  wrote:
> >
> > Hi,
> >
> > Did you ever figure out the root cause of the issue? We have recently (end 
> > of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64
> >
> > In the last few weeks we have seen an increasing number of ps hangs across 
> > compute and login nodes on our cluster. The filesystem version (of all 
> > filesystems on our cluster) is:
> >  -V                 15.01 (4.2.0.0)          File system version
> >
> > I am just wondering if anyone has seen this type of issue since you first 
> > reported it and if there is a known fix for it.
> >
> > Damir
> >
> > > On Tue, May 22, 2018 at 10:43 AM  wrote:
> > > > Hello All,
> > > >
> > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a 
> > > > month ago. We have not yet converted the 4.2.2.2 filesystem version to 
> > > > 5. ( That is we have not run the mmchconfig release=LATEST command)
> > > > Right after the upgrade, we are seeing many “ps hangs" across the 
> > > > cluster. All the “ps hangs” happen when jobs run related to a Java 
> > > > process or many Java threads (example: GATK )
> > > > The hangs are pretty random, and have no particular pattern except that 
> > > > we know that it is related to just Java or some jobs reading from 
> > > > directories with about 60 files.
> > > >
> > > > I have raised an IBM critical service request about a month ago related 
> > > > to this - PMR: 24090,L6Q,000.
> > > > However, According to the ticket  - they seemed to feel that it might 
> > > > not be related to GPFS.
> > > > Although, we are sure that these hangs started to appear only after we 
> > > > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2.
> > > >
> > > > One of the other reasons we are not able to prove that it is GPFS is 
> > > > because, we are unable to capture any logs/traces from GPFS once the 
> > > > hang happens.
> > > > Even GPFS trace commands hang, once “ps hangs” and thus it is getting 
> > > > difficult to get any dumps from GPFS.
> > > >
> > > > Also  - According to the IBM ticket, they seemed to have a seen a “ps 
> > > > hang" issue and we have to run  mmchconfig release=LATEST command, and 
> > > > that will resolve the issue.
> > > > However we are not comfortable making the permanent change to 
> > > > Filesystem version 5. and since we don’t see any near solution to these 
> > > > hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous 
> > > > state that we know the cluster was stable.
> > > >
> > > > Can downgrading GPFS take us back to exactly the previous GPFS config 
> > > > state?
> > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i 
> > > > reinstall all rpms to a previous version? or is there anything else 
> > > > that i need to make sure with respect to GPFS configuration?
> > > > Because i think that GPFS 5.0 might have updated internal default GPFS 
> > > > configuration parameters , and i am not sure if downgrading GPFS will 
> > > > change them back to what they were in GPFS 4.2.3.2
> > > >
> > > > Our previous state:
> > > >
> > > > 2 Storage clusters - 4.2.3.2
> > > > 1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage 
> > > > clusters )
> > > >
> > > > Our current state:
> > > >
> > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
> > > > 1 Compute cluster - 5.0.0.2
> > > >
> > > > Do i need to downgrade all the clusters to go to the previous state ? 
> > > > or is it ok if we just downgrade the compute cluster to previous 
> > > > version?
> > > >
> > > > Any advice on the best steps forward, would greatly help.
> > > >
> > > > Thanks,
> > > >
> > > > Lohit
> > > > ___
> > > > gpfsug-discuss mailing list
> > > > gpfsug-discuss at spectrumscale.org
> > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > ___
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

2018-09-27 Thread valleru
; > > i could think of multiple other scenarios , which is why its so 
> > > > > > > hard to accurately benchmark an application because you will 
> > > > > > > design a benchmark to test an application, but it actually almost 
> > > > > > > always behaves different then you think it does :-)
> > > > > > >
> > > > > > > so best is to run the real application and see under which 
> > > > > > > configuration it works best.
> > > > > > >
> > > > > > > you could also take a trace with trace=io and then look at
> > > > > > >
> > > > > > > TRACE_VNOP: READ:
> > > > > > > TRACE_VNOP: WRITE:
> > > > > > >
> > > > > > > and compare them to
> > > > > > >
> > > > > > > TRACE_IO: QIO: read
> > > > > > > TRACE_IO: QIO: write
> > > > > > >
> > > > > > > and see if the numbers summed up for both are somewhat equal. if 
> > > > > > > TRACE_VNOP is significant smaller than TRACE_IO you most likely 
> > > > > > > do more i/o than you should and turning prefetching off might 
> > > > > > > actually make things faster .
> > > > > > >
> > > > > > > keep in mind i am no longer working for IBM so all i say might be 
> > > > > > > obsolete by now, i no longer have access to the one and only 
> > > > > > > truth aka the source code ... but if i am wrong i am sure 
> > > > > > > somebody will point this out soon ;-)
> > > > > > >
> > > > > > > sven
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > On Tue, Sep 18, 2018 at 10:31 AM  wrote:
> > > > > > > > > Hello All,
> > > > > > > > >
> > > > > > > > > This is a continuation to the previous discussion that i had 
> > > > > > > > > with Sven.
> > > > > > > > > However against what i had mentioned previously - i realize 
> > > > > > > > > that this is “not” related to mmap, and i see it when doing 
> > > > > > > > > random freads.
> > > > > > > > >
> > > > > > > > > I see that block-size of the filesystem matters when reading 
> > > > > > > > > from Page pool.
> > > > > > > > > I see a major difference in performance when compared 1M to 
> > > > > > > > > 16M, when doing lot of random small freads with all of the 
> > > > > > > > > data in pagepool.
> > > > > > > > >
> > > > > > > > > Performance for 1M is a magnitude “more” than the performance 
> > > > > > > > > that i see for 16M.
> > > > > > > > >
> > > > > > > > > The GPFS that we have currently is :
> > > > > > > > > Version : 5.0.1-0.5
> > > > > > > > > Filesystem version: 19.01 (5.0.1.0)
> > > > > > > > > Block-size : 16M
> > > > > > > > >
> > > > > > > > > I had made the filesystem block-size to be 16M, thinking that 
> > > > > > > > > i would get the most performance for both random/sequential 
> > > > > > > > > reads from 16M than the smaller block-sizes.
> > > > > > > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 
> > > > > > > > > and thus not loose lot of storage space even with 16M.
> > > > > > > > > I had run few benchmarks and i did see that 16M was 
> > > > > > > > > performing better “when hitting storage/disks” with respect 
> > > > > > > > > to bandwidth for random/sequential on small/large reads.
> > > > > > > > >
> > > > > > > > > However, with this particular workload - where it freads a 
> > > > > > > > > chunk of data randomly from hundreds of files -> I see that 
> > > > > > > > > the number of page-faults increase with block-size and 
> > > > > > &

Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

2018-09-19 Thread valleru
> the source code ... but if i am wrong i am sure somebody will point 
> > > > this out soon ;-)
> > > >
> > > > sven
> > > >
> > > >
> > > >
> > > >
> > > > > On Tue, Sep 18, 2018 at 10:31 AM  wrote:
> > > > > > Hello All,
> > > > > >
> > > > > > This is a continuation to the previous discussion that i had with 
> > > > > > Sven.
> > > > > > However against what i had mentioned previously - i realize that 
> > > > > > this is “not” related to mmap, and i see it when doing random 
> > > > > > freads.
> > > > > >
> > > > > > I see that block-size of the filesystem matters when reading from 
> > > > > > Page pool.
> > > > > > I see a major difference in performance when compared 1M to 16M, 
> > > > > > when doing lot of random small freads with all of the data in 
> > > > > > pagepool.
> > > > > >
> > > > > > Performance for 1M is a magnitude “more” than the performance that 
> > > > > > i see for 16M.
> > > > > >
> > > > > > The GPFS that we have currently is :
> > > > > > Version : 5.0.1-0.5
> > > > > > Filesystem version: 19.01 (5.0.1.0)
> > > > > > Block-size : 16M
> > > > > >
> > > > > > I had made the filesystem block-size to be 16M, thinking that i 
> > > > > > would get the most performance for both random/sequential reads 
> > > > > > from 16M than the smaller block-sizes.
> > > > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and 
> > > > > > thus not loose lot of storage space even with 16M.
> > > > > > I had run few benchmarks and i did see that 16M was performing 
> > > > > > better “when hitting storage/disks” with respect to bandwidth for 
> > > > > > random/sequential on small/large reads.
> > > > > >
> > > > > > However, with this particular workload - where it freads a chunk of 
> > > > > > data randomly from hundreds of files -> I see that the number of 
> > > > > > page-faults increase with block-size and actually reduce the 
> > > > > > performance.
> > > > > > 1M performs a lot better than 16M, and may be i will get better 
> > > > > > performance with less than 1M.
> > > > > > It gives the best performance when reading from local disk, with 4K 
> > > > > > block size filesystem.
> > > > > >
> > > > > > What i mean by performance when it comes to this workload - is not 
> > > > > > the bandwidth but the amount of time that it takes to do each 
> > > > > > iteration/read batch of data.
> > > > > >
> > > > > > I figure what is happening is:
> > > > > > fread is trying to read a full block size of 16M - which is good in 
> > > > > > a way, when it hits the hard disk.
> > > > > > But the application could be using just a small part of that 16M. 
> > > > > > Thus when randomly reading(freads) lot of data of 16M chunk size - 
> > > > > > it is page faulting a lot more and causing the performance to drop .
> > > > > > I could try to make the application do read instead of freads, but 
> > > > > > i fear that could be bad too since it might be hitting the disk 
> > > > > > with a very small block size and that is not good.
> > > > > >
> > > > > > With the way i see things now -
> > > > > > I believe it could be best if the application does random reads of 
> > > > > > 4k/1M from pagepool but some how does 16M from rotating disks.
> > > > > >
> > > > > > I don’t see any way of doing the above other than following a 
> > > > > > different approach where i create a filesystem with a smaller block 
> > > > > > size ( 1M or less than 1M ), on SSDs as a tier.
> > > > > >
> > > > > > May i please ask for advise, if what i am understanding/seeing is 
> > > > > > right and the best solution possible for the above scenario.
> > > > > >
> > > > > > Regards,
> > > > > > Lohit
> > > > > >
&g

Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

2018-09-19 Thread valleru
with less than 1M.
> > > It gives the best performance when reading from local disk, with 4K block 
> > > size filesystem.
> > >
> > > What i mean by performance when it comes to this workload - is not the 
> > > bandwidth but the amount of time that it takes to do each iteration/read 
> > > batch of data.
> > >
> > > I figure what is happening is:
> > > fread is trying to read a full block size of 16M - which is good in a 
> > > way, when it hits the hard disk.
> > > But the application could be using just a small part of that 16M. Thus 
> > > when randomly reading(freads) lot of data of 16M chunk size - it is page 
> > > faulting a lot more and causing the performance to drop .
> > > I could try to make the application do read instead of freads, but i fear 
> > > that could be bad too since it might be hitting the disk with a very 
> > > small block size and that is not good.
> > >
> > > With the way i see things now -
> > > I believe it could be best if the application does random reads of 4k/1M 
> > > from pagepool but some how does 16M from rotating disks.
> > >
> > > I don’t see any way of doing the above other than following a different 
> > > approach where i create a filesystem with a smaller block size ( 1M or 
> > > less than 1M ), on SSDs as a tier.
> > >
> > > May i please ask for advise, if what i am understanding/seeing is right 
> > > and the best solution possible for the above scenario.
> > >
> > > Regards,
> > > Lohit
> > >
> > > On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , 
> > > wrote:
> > > > Hey Sven,
> > > >
> > > > This is regarding mmap issues and GPFS.
> > > > We had discussed previously of experimenting with GPFS 5.
> > > >
> > > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2
> > > >
> > > > I am yet to experiment with mmap performance, but before that - I am 
> > > > seeing weird hangs with GPFS 5 and I think it could be related to mmap.
> > > >
> > > > Have you seen GPFS ever hang on this syscall?
> > > > [Tue Apr 10 04:20:13 2018] [] 
> > > > _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]
> > > >
> > > > I see the above ,when kernel hangs and throws out a series of trace 
> > > > calls.
> > > >
> > > > I somehow think the above trace is related to processes hanging on GPFS 
> > > > forever. There are no errors in GPFS however.
> > > >
> > > > Also, I think the above happens only when the mmap threads go above a 
> > > > particular number.
> > > >
> > > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to 
> > > > 4.2.3.2 . At that time , the issue happened when mmap threads go more 
> > > > than worker1threads. According to the ticket - it was a mmap race 
> > > > condition that GPFS was not handling well.
> > > >
> > > > I am not sure if this issue is a repeat and I am yet to isolate the 
> > > > incident and test with increasing number of mmap threads.
> > > >
> > > > I am not 100 percent sure if this is related to mmap yet but just 
> > > > wanted to ask you if you have seen anything like above.
> > > >
> > > > Thanks,
> > > >
> > > > Lohit
> > > >
> > > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote:
> > > > > Hi Lohit,
> > > > >
> > > > > i am working with ray on a mmap performance improvement right now, 
> > > > > which most likely has the same root cause as yours , see -->  
> > > > > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> > > > > the thread above is silent after a couple of back and rorth, but ray 
> > > > > and i have active communication in the background and will repost as 
> > > > > soon as there is something new to share.
> > > > > i am happy to look at this issue after we finish with ray's workload 
> > > > > if there is something missing, but first let's finish his, get you 
> > > > > try the same fix and see if there is something missing.
> > > > >
> > > > > btw. if people would share their use of MMAP , what applications they 
> > > > > use (home grown, just use lmdb which uses mmap under the cover, etc) 
> > > > >

Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

2018-09-18 Thread valleru
Hello All,

This is a continuation to the previous discussion that i had with Sven.
However against what i had mentioned previously - i realize that this is “not” 
related to mmap, and i see it when doing random freads.

I see that block-size of the filesystem matters when reading from Page pool.
I see a major difference in performance when compared 1M to 16M, when doing lot 
of random small freads with all of the data in pagepool.

Performance for 1M is a magnitude “more” than the performance that i see for 
16M.

The GPFS that we have currently is :
Version : 5.0.1-0.5
Filesystem version: 19.01 (5.0.1.0)
Block-size : 16M

I had made the filesystem block-size to be 16M, thinking that i would get the 
most performance for both random/sequential reads from 16M than the smaller 
block-sizes.
With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose 
lot of storage space even with 16M.
I had run few benchmarks and i did see that 16M was performing better “when 
hitting storage/disks” with respect to bandwidth for random/sequential on 
small/large reads.

However, with this particular workload - where it freads a chunk of data 
randomly from hundreds of files -> I see that the number of page-faults 
increase with block-size and actually reduce the performance.
1M performs a lot better than 16M, and may be i will get better performance 
with less than 1M.
It gives the best performance when reading from local disk, with 4K block size 
filesystem.

What i mean by performance when it comes to this workload - is not the 
bandwidth but the amount of time that it takes to do each iteration/read batch 
of data.

I figure what is happening is:
fread is trying to read a full block size of 16M - which is good in a way, when 
it hits the hard disk.
But the application could be using just a small part of that 16M. Thus when 
randomly reading(freads) lot of data of 16M chunk size - it is page faulting a 
lot more and causing the performance to drop .
I could try to make the application do read instead of freads, but i fear that 
could be bad too since it might be hitting the disk with a very small block 
size and that is not good.

With the way i see things now -
I believe it could be best if the application does random reads of 4k/1M from 
pagepool but some how does 16M from rotating disks.

I don’t see any way of doing the above other than following a different 
approach where i create a filesystem with a smaller block size ( 1M or less 
than 1M ), on SSDs as a tier.

May i please ask for advise, if what i am understanding/seeing is right and the 
best solution possible for the above scenario.

Regards,
Lohit

On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , wrote:
> Hey Sven,
>
> This is regarding mmap issues and GPFS.
> We had discussed previously of experimenting with GPFS 5.
>
> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2
>
> I am yet to experiment with mmap performance, but before that - I am seeing 
> weird hangs with GPFS 5 and I think it could be related to mmap.
>
> Have you seen GPFS ever hang on this syscall?
> [Tue Apr 10 04:20:13 2018] [] 
> _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]
>
> I see the above ,when kernel hangs and throws out a series of trace calls.
>
> I somehow think the above trace is related to processes hanging on GPFS 
> forever. There are no errors in GPFS however.
>
> Also, I think the above happens only when the mmap threads go above a 
> particular number.
>
> We had faced a similar issue in 4.2.3 and it was resolved in a patch to 
> 4.2.3.2 . At that time , the issue happened when mmap threads go more than 
> worker1threads. According to the ticket - it was a mmap race condition that 
> GPFS was not handling well.
>
> I am not sure if this issue is a repeat and I am yet to isolate the incident 
> and test with increasing number of mmap threads.
>
> I am not 100 percent sure if this is related to mmap yet but just wanted to 
> ask you if you have seen anything like above.
>
> Thanks,
>
> Lohit
>
> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote:
> > Hi Lohit,
> >
> > i am working with ray on a mmap performance improvement right now, which 
> > most likely has the same root cause as yours , see -->  
> > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> > the thread above is silent after a couple of back and rorth, but ray and i 
> > have active communication in the background and will repost as soon as 
> > there is something new to share.
> > i am happy to look at this issue after we finish with ray's workload if 
> > there is something missing, but first let's finish his, get you try the 
> > same fix and see if there is something missing.
> >
> > btw. if people would share their use of MMAP , what applicatio

Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

2018-05-22 Thread valleru
Thank Dwayne.
I don’t think, we are facing anything else from network perspective as of now.
We were seeing deadlocks initially when we upgraded to 5.0, but it might not be 
because of network.
We also see deadlocks now, but they are mostly caused due to high waiters i 
believe. I have temporarily disabled deadlocks.

Thanks,
Lohit

On May 22, 2018, 12:54 PM -0400, dwayne.h...@med.mun.ca, wrote:
> We are having issues with ESS/Mellanox implementation and were curious as to 
> what you were working with from a network perspective.
>
> Best,
> Dwayne
> —
> Dwayne Hart | Systems Administrator IV
>
> CHIA, Faculty of Medicine
> Memorial University of Newfoundland
> 300 Prince Philip Drive
> St. John’s, Newfoundland | A1B 3V6
> Craig L Dobbin Building | 4M409
> T 709 864 6631
>
> On May 22, 2018, at 2:10 PM, "vall...@cbio.mskcc.org" 
>  wrote:
>
> > 10G Ethernet.
> >
> > Thanks,
> > Lohit
> >
> > On May 22, 2018, 11:55 AM -0400, dwayne.h...@med.mun.ca, wrote:
> > > Hi Lohit,
> > >
> > > What type of network are you using on the back end to transfer the GPFS 
> > > traffic?
> > >
> > > Best,
> > > Dwayne
> > >
> > > From: gpfsug-discuss-boun...@spectrumscale.org 
> > > [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of 
> > > vall...@cbio.mskcc.org
> > > Sent: Tuesday, May 22, 2018 1:13 PM
> > > To: gpfsug main discussion list 
> > > Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading 
> > > from GPFS 5.0.0-2 to GPFS 4.2.3.2
> > >
> > > Hello All,
> > >
> > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month 
> > > ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( 
> > > That is we have not run the mmchconfig release=LATEST command)
> > > Right after the upgrade, we are seeing many “ps hangs" across the 
> > > cluster. All the “ps hangs” happen when jobs run related to a Java 
> > > process or many Java threads (example: GATK )
> > > The hangs are pretty random, and have no particular pattern except that 
> > > we know that it is related to just Java or some jobs reading from 
> > > directories with about 60 files.
> > >
> > > I have raised an IBM critical service request about a month ago related 
> > > to this - PMR: 24090,L6Q,000.
> > > However, According to the ticket  - they seemed to feel that it might not 
> > > be related to GPFS.
> > > Although, we are sure that these hangs started to appear only after we 
> > > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2.
> > >
> > > One of the other reasons we are not able to prove that it is GPFS is 
> > > because, we are unable to capture any logs/traces from GPFS once the hang 
> > > happens.
> > > Even GPFS trace commands hang, once “ps hangs” and thus it is getting 
> > > difficult to get any dumps from GPFS.
> > >
> > > Also  - According to the IBM ticket, they seemed to have a seen a “ps 
> > > hang" issue and we have to run  mmchconfig release=LATEST command, and 
> > > that will resolve the issue.
> > > However we are not comfortable making the permanent change to Filesystem 
> > > version 5. and since we don’t see any near solution to these hangs - we 
> > > are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we 
> > > know the cluster was stable.
> > >
> > > Can downgrading GPFS take us back to exactly the previous GPFS config 
> > > state?
> > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i 
> > > reinstall all rpms to a previous version? or is there anything else that 
> > > i need to make sure with respect to GPFS configuration?
> > > Because i think that GPFS 5.0 might have updated internal default GPFS 
> > > configuration parameters , and i am not sure if downgrading GPFS will 
> > > change them back to what they were in GPFS 4.2.3.2
> > >
> > > Our previous state:
> > >
> > > 2 Storage clusters - 4.2.3.2
> > > 1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage clusters 
> > > )
> > >
> > > Our current state:
> > >
> > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
> > > 1 Compute cluster - 5.0.0.2
> > >
> > > Do i need to downgrade all the clusters to go to the previous state ? or 
> > > is it ok if we just downgrade the compute cluster to previous version?
> > >
> > > Any advice on the best steps forward, would greatly help.
> > >
> > > Thanks,
> > >
> > > Lohit
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > ___
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___

Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

2018-05-22 Thread valleru
10G Ethernet.

Thanks,
Lohit

On May 22, 2018, 11:55 AM -0400, dwayne.h...@med.mun.ca, wrote:
> Hi Lohit,
>
> What type of network are you using on the back end to transfer the GPFS 
> traffic?
>
> Best,
> Dwayne
>
> From: gpfsug-discuss-boun...@spectrumscale.org 
> [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of 
> vall...@cbio.mskcc.org
> Sent: Tuesday, May 22, 2018 1:13 PM
> To: gpfsug main discussion list 
> Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading 
> from GPFS 5.0.0-2 to GPFS 4.2.3.2
>
> Hello All,
>
> We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month 
> ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is 
> we have not run the mmchconfig release=LATEST command)
> Right after the upgrade, we are seeing many “ps hangs" across the cluster. 
> All the “ps hangs” happen when jobs run related to a Java process or many 
> Java threads (example: GATK )
> The hangs are pretty random, and have no particular pattern except that we 
> know that it is related to just Java or some jobs reading from directories 
> with about 60 files.
>
> I have raised an IBM critical service request about a month ago related to 
> this - PMR: 24090,L6Q,000.
> However, According to the ticket  - they seemed to feel that it might not be 
> related to GPFS.
> Although, we are sure that these hangs started to appear only after we 
> upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2.
>
> One of the other reasons we are not able to prove that it is GPFS is because, 
> we are unable to capture any logs/traces from GPFS once the hang happens.
> Even GPFS trace commands hang, once “ps hangs” and thus it is getting 
> difficult to get any dumps from GPFS.
>
> Also  - According to the IBM ticket, they seemed to have a seen a “ps hang" 
> issue and we have to run  mmchconfig release=LATEST command, and that will 
> resolve the issue.
> However we are not comfortable making the permanent change to Filesystem 
> version 5. and since we don’t see any near solution to these hangs - we are 
> thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know 
> the cluster was stable.
>
> Can downgrading GPFS take us back to exactly the previous GPFS config state?
> With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall 
> all rpms to a previous version? or is there anything else that i need to make 
> sure with respect to GPFS configuration?
> Because i think that GPFS 5.0 might have updated internal default GPFS 
> configuration parameters , and i am not sure if downgrading GPFS will change 
> them back to what they were in GPFS 4.2.3.2
>
> Our previous state:
>
> 2 Storage clusters - 4.2.3.2
> 1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage clusters )
>
> Our current state:
>
> 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
> 1 Compute cluster - 5.0.0.2
>
> Do i need to downgrade all the clusters to go to the previous state ? or is 
> it ok if we just downgrade the compute cluster to previous version?
>
> Any advice on the best steps forward, would greatly help.
>
> Thanks,
>
> Lohit
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

2018-05-22 Thread valleru
Hello All,

We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. 
We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we 
have not run the mmchconfig release=LATEST command)
Right after the upgrade, we are seeing many “ps hangs" across the cluster. All 
the “ps hangs” happen when jobs run related to a Java process or many Java 
threads (example: GATK )
The hangs are pretty random, and have no particular pattern except that we know 
that it is related to just Java or some jobs reading from directories with 
about 60 files.

I have raised an IBM critical service request about a month ago related to this 
- PMR: 24090,L6Q,000.
However, According to the ticket  - they seemed to feel that it might not be 
related to GPFS.
Although, we are sure that these hangs started to appear only after we upgraded 
GPFS to GPFS 5.0.0.2 from 4.2.3.2.

One of the other reasons we are not able to prove that it is GPFS is because, 
we are unable to capture any logs/traces from GPFS once the hang happens.
Even GPFS trace commands hang, once “ps hangs” and thus it is getting difficult 
to get any dumps from GPFS.

Also  - According to the IBM ticket, they seemed to have a seen a “ps hang" 
issue and we have to run  mmchconfig release=LATEST command, and that will 
resolve the issue.
However we are not comfortable making the permanent change to Filesystem 
version 5. and since we don’t see any near solution to these hangs - we are 
thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the 
cluster was stable.

Can downgrading GPFS take us back to exactly the previous GPFS config state?
With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall 
all rpms to a previous version? or is there anything else that i need to make 
sure with respect to GPFS configuration?
Because i think that GPFS 5.0 might have updated internal default GPFS 
configuration parameters , and i am not sure if downgrading GPFS will change 
them back to what they were in GPFS 4.2.3.2

Our previous state:

2 Storage clusters - 4.2.3.2
1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage clusters )

Our current state:

2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
1 Compute cluster - 5.0.0.2

Do i need to downgrade all the clusters to go to the previous state ? or is it 
ok if we just downgrade the compute cluster to previous version?

Any advice on the best steps forward, would greatly help.

Thanks,

Lohit
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks

2018-05-15 Thread valleru
Thank you for the detailed answer Andrew.
I do understand that anything above the posix level will not be supported by 
IBM and might lead to scaling/other issues.
We will start small, and discuss with IBM representative on any other possible 
efforts.

Regards,
Lohit

On May 15, 2018, 10:39 PM -0400, Andrew Beattie <abeat...@au1.ibm.com>, wrote:
> Lohit,
>
> There is no technical reason why if you use the correct licensing that you 
> can't publish a Posix fileystem using external Protocol tool rather than CES
> the key thing to note is that if its not the IBM certified solution that IBM 
> support stops at the Posix level and the protocol issues are your own to 
> resolve.
>
> The reason we provide the CES environment is to provide a supported 
> architecture to deliver protocol access,  does it have some limitations - 
> certainly
> but it is a supported environment.  Moving away from this moves the risk onto 
> the customer to resolve and maintain.
>
> The other part of this, and potentially the reason why you might have been 
> warned off using an external solution is that not all systems provide 
> scalability and resiliency
> so you may end up bumping into scaling issues by building your own 
> environment --- and from the sound of things this is a large complex 
> environment.  These issues are clearly defined in the CES stack and are well 
> understood.  moving away from this will move you into the realm of the 
> unknown -- again the risk becomes yours.
>
> it may well be worth putting a request in with your local IBM representative 
> to have IBM Scale protocol development team involved in your design and see 
> what we can support for your requirements.
>
>
> Regards,
> Andrew Beattie
> Software Defined Storage  - IT Specialist
> Phone: 614-2133-7927
> E-mail: abeat...@au1.ibm.com
>
>
> > - Original message -
> > From: vall...@cbio.mskcc.org
> > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> > To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
> > Cc:
> > Subject: Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks
> > Date: Wed, May 16, 2018 12:25 PM
> >
> > Thanks Stephen,
> >
> > Yes i do acknowledge, that it will need a SERVER license and thank you for 
> > reminding me.
> >
> > I just wanted to make sure, from the technical point of view that we won’t 
> > face any issues by exporting a GPFS mount as a SMB export.
> >
> > I remember, i had seen in documentation about few years ago that it is not 
> > recommended to export a GPFS mount via Third party SMB services (not CES). 
> > But i don’t exactly remember why.
> >
> > Regards,
> > Lohit
> >
> > On May 15, 2018, 10:19 PM -0400, Stephen Ulmer <ul...@ulmer.org>, wrote:
> > > Lohit,
> > >
> > > Just be aware that exporting the data from GPFS via SMB requires a SERVER 
> > > license for the node in question. You’ve mentioned client a few times 
> > > now. :)
> > >
> > > --
> > > Stephen
> > >
> > >
> > >
> > > > On May 15, 2018, at 6:48 PM, Lohit Valleru <vall...@cbio.mskcc.org> 
> > > > wrote:
> > > >
> > > > Thanks Christof.
> > > >
> > > > The usecase is just that : it is easier to have symlinks of files/dirs 
> > > > from various locations/filesystems rather than copying or duplicating 
> > > > that data.
> > > >
> > > > The design from many years was maintaining about 8 PB of NFS filesystem 
> > > > with thousands of symlinks to various locations and the same 
> > > > directories being exported on SMB.
> > > >
> > > > Now we are migrating most of the data to GPFS keeping the symlinks as 
> > > > they are.
> > > > Thus the need to follow symlinks from the GPFS filesystem to the NFS 
> > > > Filesystem.
> > > > The client wants to effectively use the symlinks design that works when 
> > > > used on Linux but is not happy to hear that he will have to redo years 
> > > > of work just because GPFS does not support the same.
> > > >
> > > > I understand that there might be a reason on why CES might not support 
> > > > this, but is it an issue if we run SMB server on the GPFS clients to 
> > > > expose a read only or read write GPFS mounts?
> > > >
> > > > Regards,
> > > >
> > > > Lohit
> > > >
> > > > On May 15, 2018, 6:32 PM -0400, Christof Schm

Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks

2018-05-15 Thread valleru
Thanks Stephen,

Yes i do acknowledge, that it will need a SERVER license and thank you for 
reminding me.

I just wanted to make sure, from the technical point of view that we won’t face 
any issues by exporting a GPFS mount as a SMB export.

I remember, i had seen in documentation about few years ago that it is not 
recommended to export a GPFS mount via Third party SMB services (not CES). But 
i don’t exactly remember why.

Regards,
Lohit

On May 15, 2018, 10:19 PM -0400, Stephen Ulmer <ul...@ulmer.org>, wrote:
> Lohit,
>
> Just be aware that exporting the data from GPFS via SMB requires a SERVER 
> license for the node in question. You’ve mentioned client a few times now. :)
>
> --
> Stephen
>
>
>
> > On May 15, 2018, at 6:48 PM, Lohit Valleru <vall...@cbio.mskcc.org> wrote:
> >
> > Thanks Christof.
> >
> > The usecase is just that : it is easier to have symlinks of files/dirs from 
> > various locations/filesystems rather than copying or duplicating that data.
> >
> > The design from many years was maintaining about 8 PB of NFS filesystem 
> > with thousands of symlinks to various locations and the same directories 
> > being exported on SMB.
> >
> > Now we are migrating most of the data to GPFS keeping the symlinks as they 
> > are.
> > Thus the need to follow symlinks from the GPFS filesystem to the NFS 
> > Filesystem.
> > The client wants to effectively use the symlinks design that works when 
> > used on Linux but is not happy to hear that he will have to redo years of 
> > work just because GPFS does not support the same.
> >
> > I understand that there might be a reason on why CES might not support 
> > this, but is it an issue if we run SMB server on the GPFS clients to expose 
> > a read only or read write GPFS mounts?
> >
> > Regards,
> >
> > Lohit
> >
> > On May 15, 2018, 6:32 PM -0400, Christof Schmitt 
> > <christof.schm...@us.ibm.com>, wrote:
> > > > I could use CES, but CES does not support follow-symlinks outside 
> > > > respective SMB export.
> > >
> > > Samba has the 'wide links' option, that we currently do not test and 
> > > support as part of the mmsmb integration. You can always open a RFE and 
> > > ask that we support this option in a future release.
> > >
> > > > Follow-symlinks is a however a hard-requirement  for to follow links 
> > > > outside GPFS filesystems.
> > >
> > > I might be reading this wrong, but do you actually want symlinks that 
> > > point to a file or directory outside of the GPFS file system? Could you 
> > > outline a usecase for that?
> > >
> > > Regards,
> > >
> > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ
> > > christof.schm...@us.ibm.com  ||  +1-520-799-2469    (T/L: 321-2469)
> > >
> > >
> > > > - Original message -
> > > > From: vall...@cbio.mskcc.org
> > > > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> > > > To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
> > > > Cc:
> > > > Subject: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks
> > > > Date: Tue, May 15, 2018 3:04 PM
> > > >
> > > > Hello All,
> > > >
> > > > Has anyone tried serving SMB export of GPFS mounts from a SMB server on 
> > > > GPFS client? Is it supported and does it lead to any issues?
> > > > I understand that i will not need a redundant SMB server configuration.
> > > >
> > > > I could use CES, but CES does not support follow-symlinks outside 
> > > > respective SMB export. Follow-symlinks is a however a hard-requirement  
> > > > for to follow links outside GPFS filesystems.
> > > >
> > > > Thanks,
> > > > Lohit
> > > >
> > > >
> > > > ___
> > > > gpfsug-discuss mailing list
> > > > gpfsug-discuss at spectrumscale.org
> > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > >
> > >
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > ___
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks

2018-05-15 Thread Lohit Valleru
Thanks Christof.

The usecase is just that : it is easier to have symlinks of files/dirs from 
various locations/filesystems rather than copying or duplicating that data.

The design from many years was maintaining about 8 PB of NFS filesystem with 
thousands of symlinks to various locations and the same directories being 
exported on SMB.

Now we are migrating most of the data to GPFS keeping the symlinks as they are.
Thus the need to follow symlinks from the GPFS filesystem to the NFS Filesystem.
The client wants to effectively use the symlinks design that works when used on 
Linux but is not happy to hear that he will have to redo years of work just 
because GPFS does not support the same.

I understand that there might be a reason on why CES might not support this, 
but is it an issue if we run SMB server on the GPFS clients to expose a read 
only or read write GPFS mounts?

Regards,

Lohit

On May 15, 2018, 6:32 PM -0400, Christof Schmitt , 
wrote:
> > I could use CES, but CES does not support follow-symlinks outside 
> > respective SMB export.
>
> Samba has the 'wide links' option, that we currently do not test and support 
> as part of the mmsmb integration. You can always open a RFE and ask that we 
> support this option in a future release.
>
> > Follow-symlinks is a however a hard-requirement  for to follow links 
> > outside GPFS filesystems.
>
> I might be reading this wrong, but do you actually want symlinks that point 
> to a file or directory outside of the GPFS file system? Could you outline a 
> usecase for that?
>
> Regards,
>
> Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ
> christof.schm...@us.ibm.com  ||  +1-520-799-2469    (T/L: 321-2469)
>
>
> > - Original message -
> > From: vall...@cbio.mskcc.org
> > Sent by: gpfsug-discuss-boun...@spectrumscale.org
> > To: gpfsug main discussion list 
> > Cc:
> > Subject: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks
> > Date: Tue, May 15, 2018 3:04 PM
> >
> > Hello All,
> >
> > Has anyone tried serving SMB export of GPFS mounts from a SMB server on 
> > GPFS client? Is it supported and does it lead to any issues?
> > I understand that i will not need a redundant SMB server configuration.
> >
> > I could use CES, but CES does not support follow-symlinks outside 
> > respective SMB export. Follow-symlinks is a however a hard-requirement  for 
> > to follow links outside GPFS filesystems.
> >
> > Thanks,
> > Lohit
> >
> >
> > ___
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] SMB server on GPFS clients and Followsymlinks

2018-05-15 Thread valleru
Hello All,

Has anyone tried serving SMB export of GPFS mounts from a SMB server on GPFS 
client? Is it supported and does it lead to any issues?
I understand that i will not need a redundant SMB server configuration.

I could use CES, but CES does not support follow-symlinks outside respective 
SMB export. Follow-symlinks is a however a hard-requirement  for to follow 
links outside GPFS filesystems.

Thanks,
Lohit

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Spectrum Scale CES , SAMBA and AD keytab integration with userdefined authentication

2018-05-03 Thread valleru
Hello All,

I am trying to export a single remote filesystem over NFS/SMB using GPFS CES. ( 
GPFS 5.0.0.2 and CentOS 7 ).

We need NFS exports to be accessible on client nodes, that use public key 
authentication and ldap authorization. I already have this working with a 
previous CES setup on user-defined authentication, where users can just login 
to the client nodes, and access NFS mounts.

However, i will also need SAMBA exports for the same GPFS filesystem with 
AD/kerberos authentication.
Previously, we used to have a working SAMBA export for a local filesystem with 
SSSD and AD integration with SAMBA as mentioned in the below solution from 
redhat.
https://access.redhat.com/solutions/2221561
We find the above as cleaner solution with respect to AD and Samba integration 
compared to centrify or winbind.

I understand that GPFS does offer AD authentication, however i believe i cannot 
use the same since NFS will need user-defined authentication and SAMBA will 
need AD authentication.

I have thus been trying to use user-defined authentication.
I tried to edit smb.cnf from GPFS ( with a bit of help from this blog, written 
by Simon. 
https://www.roamingzebra.co.uk/2015/07/smb-protocol-support-with-spectrum.html)

/usr/lpp/mmfs/bin/net conf list

realm = 
workgroup = 
security = ads
kerberos method = secrets and key tab
idmap config * : backend = tdb template
homedir = /home/%U
dedicated keytab file = /etc/krb5.keytab

I had joined the node to AD with realmd and i do get relevant AD info when i 
try:
/usr/lpp/mmfs/bin/net ads info

However, when i try to display keytab or add principals to keytab. It just does 
not work.
/usr/lpp/mmfs/bin/net ads keytab list  -> does not show the keys present in 
/etc/krb5.keytab.
/usr/lpp/mmfs/bin/net ads keytab add cifs -> does not add the keys to the 
/etc/krb5.keytab

As per the samba documentation, these two parameters should help samba 
automatically find the keytab file.
kerberos method = secrets and key tab
dedicated keytab file = /etc/krb5.keytab

I have not yet tried to see, if a SAMBA export is working with AD 
authentication but i am afraid it might not work.

Have anyone tried the AD integration with SSSD/SAMBA for GPFS, and any 
suggestions on how to debug the above would be really helpful.

Thanks,
Lohit

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts

2018-05-03 Thread valleru
Thanks Simon.
Currently, we are thinking of using the same remote filesystem for both NFS/SMB 
exports.
I do have a related question with respect to SMB and AD integration on 
user-defined authentication.
I have seen a past discussion from you on the usergroup regarding a similar 
integration, but i am trying a different setup.
Will send an email with the related subject.

Thanks,
Lohit

On May 3, 2018, 1:30 PM -0400, Simon Thompson (IT Research Support) 
, wrote:
> Yes we do this when we really really need to take a remote FS offline, which 
> we try at all costs to avoid unless we have a maintenance window.
>
> Note if you only export via SMB, then you don’t have the same effect (unless 
> something has changed recently)
>
> Simon
>
> From:  on behalf of 
> "vall...@cbio.mskcc.org" 
> Reply-To: "gpfsug-discuss@spectrumscale.org" 
> 
> Date: Thursday, 3 May 2018 at 15:41
> To: "gpfsug-discuss@spectrumscale.org" 
> Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
>
> Thanks Mathiaz,
> Yes i do understand the concern, that if one of the remote file systems go 
> down abruptly - the others will go down too.
>
> However, i suppose we could bring down one of the filesystems before a 
> planned downtime?
> For example, by unexporting the filesystems on NFS/SMB before the downtime?
>
> I might not want to be in a situation, where i have to bring down all the 
> remote filesystems because of planned downtime of one of the remote clusters.
>
> Regards,
> Lohit
>
> On May 3, 2018, 7:41 AM -0400, Mathias Dietz , wrote:
>
> > Hi Lohit,
> >
> > >I am thinking of using a single CES protocol cluster, with remote mounts 
> > >from 3 storage clusters.
> > Technically this should work fine (assuming all 3 clusters use the same 
> > uids/guids). However this has not been tested in our Test lab.
> >
> >
> > >One thing to watch, be careful if your CES root is on a remote fs, as if 
> > >that goes away, so do all CES exports.
> > Not only the ces root file system is a concern, the whole CES cluster will 
> > go down if any remote file systems with NFS exports is not available.
> > e.g. if remote cluster 1 is not available, the CES cluster will unmount the 
> > corresponding file system which will lead to a NFS failure on all CES nodes.
> >
> >
> > Mit freundlichen Grüßen / Kind regards
> >
> > Mathias Dietz
> >
> > Spectrum Scale Development - Release Lead Architect (4.2.x)
> > Spectrum Scale RAS Architect
> > ---
> > IBM Deutschland
> > Am Weiher 24
> > 65451 Kelsterbach
> > Phone: +49 70342744105
> > Mobile: +49-15152801035
> > E-Mail: mdi...@de.ibm.com
> > -
> > IBM Deutschland Research & Development GmbH
> > Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk 
> > WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht 
> > Stuttgart, HRB 243294
> >
> >
> >
> > From:        vall...@cbio.mskcc.org
> > To:        gpfsug main discussion list 
> > Date:        01/05/2018 16:34
> > Subject:        Re: [gpfsug-discuss] Spectrum Scale CES and remote file 
> > system mounts
> > Sent by:        gpfsug-discuss-boun...@spectrumscale.org
> >
> >
> >
> > Thanks Simon.
> > I will make sure i am careful about the CES root and test nfs exporting 
> > more than 2 remote file systems.
> >
> > Regards,
> > Lohit
> >
> > On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) 
> > , wrote:
> > You have been able to do this for some time, though I think it's only just 
> > supported.
> >
> > We've been exporting remote mounts since CES was added.
> >
> > At some point we've had two storage clusters supplying data and at least 3 
> > remote file-systems exported over NFS and SMB.
> >
> > One thing to watch, be careful if your CES root is on a remote fs, as if 
> > that goes away, so do all CES exports. We do have CES root on a remote fs 
> > and it works, just be aware...
> >
> > Simon
> > 
> > From: gpfsug-discuss-boun...@spectrumscale.org 
> > [gpfsug-discuss-boun...@spectrumscale.org] on behalf of 
> > vall...@cbio.mskcc.org [vall...@cbio.mskcc.org]
> > Sent: 30 April 2018 22:11
> > To: gpfsug main discussion list
> > Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
> >
> > Hello All,
> >
> > I read from the below link, that it is now possible to export remote mounts 
> > over NFS/SMB.
> >
> > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm
> >
> > I am thinking of using a single CES protocol cluster, with remote mounts 
> > from 3 storage 

Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts

2018-05-03 Thread valleru
Thanks Bryan.
Yes i do understand it now, with respect to multi clusters reading the same 
file and metanode flapping.
Will make sure the workload design will prevent metanode flapping.

Regards,
Lohit

On May 3, 2018, 11:15 AM -0400, Bryan Banister , 
wrote:
> Hi Lohit,
>
> Please see slides 13 and 14 in the presentation that DDN gave at the GPFS UG 
> in the UK this April:  
> http://files.gpfsug.org/presentations/2018/London/2-5_GPFSUG_London_2018_VCC_DDN_Overheads.pdf
>
> Multicluster setups with shared file access have a high probability of 
> “MetaNode Flapping”
> • “MetaNode role transfer occurs when the same files from a filesystem are 
> accessed from two or more “client” clusters via a MultiCluster relationship.”
>
> Cheers,
> -Bryan
>
> From: gpfsug-discuss-boun...@spectrumscale.org 
> [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of 
> vall...@cbio.mskcc.org
> Sent: Thursday, May 03, 2018 9:46 AM
> To: gpfsug main discussion list 
> Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
>
> Note: External Email
> Thanks Brian,
> May i know, if you could explain a bit more on the metadata updates issue?
> I am not sure i exactly understand on why the metadata updates would fail 
> between filesystems/between clusters - since every remote cluster will have 
> its own metadata pool/servers.
> I suppose the metadata updates for respective remote filesystems should go to 
> respective remote clusters/metadata servers and should not depend on metadata 
> servers of other remote clusters?
> Please do correct me if i am wrong.
> As of now, our workload is to use NFS/SMB to read files and update files from 
> different remote servers. It is not for running heavy parallel read/write 
> workloads across different servers.
>
> Thanks,
> Lohit
>
> On May 3, 2018, 10:25 AM -0400, Bryan Banister , 
> wrote:
>
> > Hi Lohit,
> >
> > Just another thought, you also have to consider that metadata updates will 
> > have to fail between nodes in the CES cluster with those in other clusters 
> > because nodes in separate remote clusters do not communicate directly for 
> > metadata updates, which depends on your workload is that would be an issue.
> >
> > Cheers,
> > -Bryan
> >
> > From: gpfsug-discuss-boun...@spectrumscale.org 
> > [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Mathias Dietz
> > Sent: Thursday, May 03, 2018 6:41 AM
> > To: gpfsug main discussion list 
> > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system 
> > mounts
> >
> > Note: External Email
> > Hi Lohit,
> >
> > >I am thinking of using a single CES protocol cluster, with remote mounts 
> > >from 3 storage clusters.
> > Technically this should work fine (assuming all 3 clusters use the same 
> > uids/guids). However this has not been tested in our Test lab.
> >
> >
> > >One thing to watch, be careful if your CES root is on a remote fs, as if 
> > >that goes away, so do all CES exports.
> > Not only the ces root file system is a concern, the whole CES cluster will 
> > go down if any remote file systems with NFS exports is not available.
> > e.g. if remote cluster 1 is not available, the CES cluster will unmount the 
> > corresponding file system which will lead to a NFS failure on all CES nodes.
> >
> >
> > Mit freundlichen Grüßen / Kind regards
> >
> > Mathias Dietz
> >
> > Spectrum Scale Development - Release Lead Architect (4.2.x)
> > Spectrum Scale RAS Architect
> > ---
> > IBM Deutschland
> > Am Weiher 24
> > 65451 Kelsterbach
> > Phone: +49 70342744105
> > Mobile: +49-15152801035
> > E-Mail: mdi...@de.ibm.com
> > -
> > IBM Deutschland Research & Development GmbH
> > Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk 
> > WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht 
> > Stuttgart, HRB 243294
> >
> >
> >
> > From:        vall...@cbio.mskcc.org
> > To:        gpfsug main discussion list 
> > Date:        01/05/2018 16:34
> > Subject:        Re: [gpfsug-discuss] Spectrum Scale CES and remote file 
> > system mounts
> > Sent by:        gpfsug-discuss-boun...@spectrumscale.org
> >
> >
> >
> > Thanks Simon.
> > I will make sure i am careful about the CES root and test nfs exporting 
> > more than 2 remote file systems.
> >
> > Regards,
> > Lohit
> >
> > On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) 
> > , wrote:
> > You have been able to do this for some time, though I think it's only just 
> > supported.
> >
> > We've been exporting remote mounts since CES was added.
> >
> > At some point we've had two storage clusters supplying data and at least 3 
> > remote file-systems 

Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts

2018-05-03 Thread valleru
Thanks Brian,
May i know, if you could explain a bit more on the metadata updates issue?
I am not sure i exactly understand on why the metadata updates would fail 
between filesystems/between clusters - since every remote cluster will have its 
own metadata pool/servers.
I suppose the metadata updates for respective remote filesystems should go to 
respective remote clusters/metadata servers and should not depend on metadata 
servers of other remote clusters?
Please do correct me if i am wrong.
As of now, our workload is to use NFS/SMB to read files and update files from 
different remote servers. It is not for running heavy parallel read/write 
workloads across different servers.

Thanks,
Lohit

On May 3, 2018, 10:25 AM -0400, Bryan Banister , 
wrote:
> Hi Lohit,
>
> Just another thought, you also have to consider that metadata updates will 
> have to fail between nodes in the CES cluster with those in other clusters 
> because nodes in separate remote clusters do not communicate directly for 
> metadata updates, which depends on your workload is that would be an issue.
>
> Cheers,
> -Bryan
>
> From: gpfsug-discuss-boun...@spectrumscale.org 
> [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Mathias Dietz
> Sent: Thursday, May 03, 2018 6:41 AM
> To: gpfsug main discussion list 
> Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
>
> Note: External Email
> Hi Lohit,
>
> >I am thinking of using a single CES protocol cluster, with remote mounts 
> >from 3 storage clusters.
> Technically this should work fine (assuming all 3 clusters use the same 
> uids/guids). However this has not been tested in our Test lab.
>
>
> >One thing to watch, be careful if your CES root is on a remote fs, as if 
> >that goes away, so do all CES exports.
> Not only the ces root file system is a concern, the whole CES cluster will go 
> down if any remote file systems with NFS exports is not available.
> e.g. if remote cluster 1 is not available, the CES cluster will unmount the 
> corresponding file system which will lead to a NFS failure on all CES nodes.
>
>
> Mit freundlichen Grüßen / Kind regards
>
> Mathias Dietz
>
> Spectrum Scale Development - Release Lead Architect (4.2.x)
> Spectrum Scale RAS Architect
> ---
> IBM Deutschland
> Am Weiher 24
> 65451 Kelsterbach
> Phone: +49 70342744105
> Mobile: +49-15152801035
> E-Mail: mdi...@de.ibm.com
> -
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk 
> WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht 
> Stuttgart, HRB 243294
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        gpfsug main discussion list 
> Date:        01/05/2018 16:34
> Subject:        Re: [gpfsug-discuss] Spectrum Scale CES and remote file 
> system mounts
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> Thanks Simon.
> I will make sure i am careful about the CES root and test nfs exporting more 
> than 2 remote file systems.
>
> Regards,
> Lohit
>
> On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) 
> , wrote:
> You have been able to do this for some time, though I think it's only just 
> supported.
>
> We've been exporting remote mounts since CES was added.
>
> At some point we've had two storage clusters supplying data and at least 3 
> remote file-systems exported over NFS and SMB.
>
> One thing to watch, be careful if your CES root is on a remote fs, as if that 
> goes away, so do all CES exports. We do have CES root on a remote fs and it 
> works, just be aware...
>
> Simon
> 
> From: gpfsug-discuss-boun...@spectrumscale.org 
> [gpfsug-discuss-boun...@spectrumscale.org] on behalf of 
> vall...@cbio.mskcc.org [vall...@cbio.mskcc.org]
> Sent: 30 April 2018 22:11
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
>
> Hello All,
>
> I read from the below link, that it is now possible to export remote mounts 
> over NFS/SMB.
>
> https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm
>
> I am thinking of using a single CES protocol cluster, with remote mounts from 
> 3 storage clusters.
> May i know, if i will be able to export the 3 remote mounts(from 3 storage 
> clusters) over NFS/SMB from a single CES protocol cluster?
>
> Because according to the limitations as mentioned in the below link:
>
> https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm
>
> It says “You can configure one storage cluster and up to five protocol 
> 

Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts

2018-05-03 Thread valleru
Thanks Mathiaz,
Yes i do understand the concern, that if one of the remote file systems go down 
abruptly - the others will go down too.

However, i suppose we could bring down one of the filesystems before a planned 
downtime?
For example, by unexporting the filesystems on NFS/SMB before the downtime?

I might not want to be in a situation, where i have to bring down all the 
remote filesystems because of planned downtime of one of the remote clusters.

Regards,
Lohit

On May 3, 2018, 7:41 AM -0400, Mathias Dietz , wrote:
> Hi Lohit,
>
> >I am thinking of using a single CES protocol cluster, with remote mounts 
> >from 3 storage clusters.
> Technically this should work fine (assuming all 3 clusters use the same 
> uids/guids). However this has not been tested in our Test lab.
>
>
> >One thing to watch, be careful if your CES root is on a remote fs, as if 
> >that goes away, so do all CES exports.
> Not only the ces root file system is a concern, the whole CES cluster will go 
> down if any remote file systems with NFS exports is not available.
> e.g. if remote cluster 1 is not available, the CES cluster will unmount the 
> corresponding file system which will lead to a NFS failure on all CES nodes.
>
>
> Mit freundlichen Grüßen / Kind regards
>
> Mathias Dietz
>
> Spectrum Scale Development - Release Lead Architect (4.2.x)
> Spectrum Scale RAS Architect
> ---
> IBM Deutschland
> Am Weiher 24
> 65451 Kelsterbach
> Phone: +49 70342744105
> Mobile: +49-15152801035
> E-Mail: mdi...@de.ibm.com
> -
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk 
> WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht 
> Stuttgart, HRB 243294
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        gpfsug main discussion list 
> Date:        01/05/2018 16:34
> Subject:        Re: [gpfsug-discuss] Spectrum Scale CES and remote file 
> system mounts
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> Thanks Simon.
> I will make sure i am careful about the CES root and test nfs exporting more 
> than 2 remote file systems.
>
> Regards,
> Lohit
>
> On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) 
> , wrote:
> You have been able to do this for some time, though I think it's only just 
> supported.
>
> We've been exporting remote mounts since CES was added.
>
> At some point we've had two storage clusters supplying data and at least 3 
> remote file-systems exported over NFS and SMB.
>
> One thing to watch, be careful if your CES root is on a remote fs, as if that 
> goes away, so do all CES exports. We do have CES root on a remote fs and it 
> works, just be aware...
>
> Simon
> 
> From: gpfsug-discuss-boun...@spectrumscale.org 
> [gpfsug-discuss-boun...@spectrumscale.org] on behalf of 
> vall...@cbio.mskcc.org [vall...@cbio.mskcc.org]
> Sent: 30 April 2018 22:11
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
>
> Hello All,
>
> I read from the below link, that it is now possible to export remote mounts 
> over NFS/SMB.
>
> https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm
>
> I am thinking of using a single CES protocol cluster, with remote mounts from 
> 3 storage clusters.
> May i know, if i will be able to export the 3 remote mounts(from 3 storage 
> clusters) over NFS/SMB from a single CES protocol cluster?
>
> Because according to the limitations as mentioned in the below link:
>
> https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm
>
> It says “You can configure one storage cluster and up to five protocol 
> clusters (current limit).”
>
>
> Regards,
> Lohit
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts

2018-05-01 Thread valleru
Thanks Simon.
I will make sure i am careful about the CES root and test nfs exporting more 
than 2 remote file systems.

Regards,
Lohit

On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) 
, wrote:
> You have been able to do this for some time, though I think it's only just 
> supported.
>
> We've been exporting remote mounts since CES was added.
>
> At some point we've had two storage clusters supplying data and at least 3 
> remote file-systems exported over NFS and SMB.
>
> One thing to watch, be careful if your CES root is on a remote fs, as if that 
> goes away, so do all CES exports. We do have CES root on a remote fs and it 
> works, just be aware...
>
> Simon
> 
> From: gpfsug-discuss-boun...@spectrumscale.org 
> [gpfsug-discuss-boun...@spectrumscale.org] on behalf of 
> vall...@cbio.mskcc.org [vall...@cbio.mskcc.org]
> Sent: 30 April 2018 22:11
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
>
> Hello All,
>
> I read from the below link, that it is now possible to export remote mounts 
> over NFS/SMB.
>
> https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm
>
> I am thinking of using a single CES protocol cluster, with remote mounts from 
> 3 storage clusters.
> May i know, if i will be able to export the 3 remote mounts(from 3 storage 
> clusters) over NFS/SMB from a single CES protocol cluster?
>
> Because according to the limitations as mentioned in the below link:
>
> https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm
>
> It says “You can configure one storage cluster and up to five protocol 
> clusters (current limit).”
>
>
> Regards,
> Lohit
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Spectrum Scale CES and remote file system mounts

2018-04-30 Thread valleru
Hello All,

I read from the below link, that it is now possible to export remote mounts 
over NFS/SMB.

https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm

I am thinking of using a single CES protocol cluster, with remote mounts from 3 
storage clusters.
May i know, if i will be able to export the 3 remote mounts(from 3 storage 
clusters) over NFS/SMB from a single CES protocol cluster?

Because according to the limitations as mentioned in the below link:

https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm

It says “You can configure one storage cluster and up to five protocol clusters 
(current limit).”


Regards,
Lohit
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Singularity + GPFS

2018-04-26 Thread valleru
We do run Singularity + GPFS, on our production HPC clusters.
Most of the time things are fine without any issues.

However, i do see a significant performance loss when running some applications 
on singularity containers with GPFS.

As of now, the applications that have severe performance issues with 
singularity on GPFS - seem to be because of “mmap io”. (Deep learning 
applications)
When i run the same application on bare metal, they seem to have a huge 
difference in GPFS IO when compared to running on singularity containers.
I am yet to raise a PMR about this with IBM.
I have not seen performance degradation for any other kind of IO, but i am not 
sure.

Regards,
Lohit

On Apr 26, 2018, 10:35 AM -0400, Nathan Harper , 
wrote:
> We are running on a test system at the moment, and haven't run into any 
> issues yet, but so far it's only been 'hello world' and running FIO.
>
> I'm interested to hear about experience with MPI-IO within Singularity.
>
> > On 26 April 2018 at 15:20, Oesterlin, Robert  
> > wrote:
> > > Anyone (including IBM) doing any work in this area? I would appreciate 
> > > hearing from you.
> > >
> > > Bob Oesterlin
> > > Sr Principal Storage Engineer, Nuance
> > >
> > >
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > >
>
>
>
> --
> Nathan Harper // IT Systems Lead
>
>
> e: nathan.har...@cfms.org.uk   t: 0117 906 1104  m:  0787 551 0891  w: 
> www.cfms.org.uk
> CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // 
> Emersons Green // Bristol // BS16 7FR
>
> CFMS Services Ltd is registered in England and Wales No 05742022 - a 
> subsidiary of CFMS Ltd
> CFMS Services Ltd registered office // 43 Queens Square // Bristol // BS1 4QP
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] GPFS, MMAP and Pagepool

2018-04-11 Thread Lohit Valleru
Hey Sven,

This is regarding mmap issues and GPFS.
We had discussed previously of experimenting with GPFS 5.

I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2

I am yet to experiment with mmap performance, but before that - I am seeing 
weird hangs with GPFS 5 and I think it could be related to mmap.

Have you seen GPFS ever hang on this syscall?
[Tue Apr 10 04:20:13 2018] [] 
_ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]

I see the above ,when kernel hangs and throws out a series of trace calls.

I somehow think the above trace is related to processes hanging on GPFS 
forever. There are no errors in GPFS however.

Also, I think the above happens only when the mmap threads go above a 
particular number.

We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 
. At that time , the issue happened when mmap threads go more than 
worker1threads. According to the ticket - it was a mmap race condition that 
GPFS was not handling well.

I am not sure if this issue is a repeat and I am yet to isolate the incident 
and test with increasing number of mmap threads.

I am not 100 percent sure if this is related to mmap yet but just wanted to ask 
you if you have seen anything like above.

Thanks,

Lohit

On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote:
> Hi Lohit,
>
> i am working with ray on a mmap performance improvement right now, which most 
> likely has the same root cause as yours , see -->  
> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> the thread above is silent after a couple of back and rorth, but ray and i 
> have active communication in the background and will repost as soon as there 
> is something new to share.
> i am happy to look at this issue after we finish with ray's workload if there 
> is something missing, but first let's finish his, get you try the same fix 
> and see if there is something missing.
>
> btw. if people would share their use of MMAP , what applications they use 
> (home grown, just use lmdb which uses mmap under the cover, etc) please let 
> me know so i get a better picture on how wide the usage is with GPFS. i know 
> a lot of the ML/DL workloads are using it, but i would like to know what else 
> is out there i might not think about. feel free to drop me a personal note, i 
> might not reply to it right away, but eventually.
>
> thx. sven
>
>
> > On Thu, Feb 22, 2018 at 12:33 PM  wrote:
> > > Hi all,
> > >
> > > I wanted to know, how does mmap interact with GPFS pagepool with respect 
> > > to filesystem block-size?
> > > Does the efficiency depend on the mmap read size and the block-size of 
> > > the filesystem even if all the data is cached in pagepool?
> > >
> > > GPFS 4.2.3.2 and CentOS7.
> > >
> > > Here is what i observed:
> > >
> > > I was testing a user script that uses mmap to read from 100M to 500MB 
> > > files.
> > >
> > > The above files are stored on 3 different filesystems.
> > >
> > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > >
> > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the 
> > > required files fully cached" from the above GPFS cluster as home. Data 
> > > and Metadata together on SSDs
> > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > >
> > > When i run the script first time for “each" filesystem:
> > > I see that GPFS reads from the files, and caches into the pagepool as it 
> > > reads, from mmdiag -- iohist
> > >
> > > When i run the second time, i see that there are no IO requests from the 
> > > compute node to GPFS NSD servers, which is expected since all the data 
> > > from the 3 filesystems is cached.
> > >
> > > However - the time taken for the script to run for the files in the 3 
> > > different filesystems is different - although i know that they are just 
> > > "mmapping"/reading from pagepool/cache and not from disk.
> > >
> > > Here is the difference in time, for IO just from pagepool:
> > >
> > > 20s 4M block size
> > > 15s 1M block size
> > > 40S 16M block size.
> > >
> > > Why do i see a difference when trying to mmap reads from different 
> > > block-size filesystems, although i see that the IO requests are not 
> > > hitting disks and just the pagepool?
> > >
> > > I am willing to share the strace output and mmdiag outputs if needed.
> > >
> > > Thanks,
> > > Lohit
> > >
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss 

Re: [gpfsug-discuss] sublocks per block in GPFS 5.0

2018-03-30 Thread valleru
Thanks Mark,

I did not know, we could explicitly mention sub-block size when creating File 
system. It is no-where mentioned in the “man mmcrfs”.
Is this a new GPFS 5.0 feature?

Also, i see from the “man mmcrfs” that the default sub-block size for 8M and 
16M is 16K.

+‐‐‐+‐‐‐+
| Block size                    | Subblock size                 |
+‐‐‐+‐‐‐+
| 64 KiB                        | 2 KiB                         |
+‐‐‐+‐‐‐+
| 128 KiB                       | 4 KiB                         |
+‐‐‐+‐‐‐+
| 256 KiB, 512 KiB, 1 MiB, 2    | 8 KiB                         |
| MiB, 4 MiB                    |                               |
+‐‐‐+‐‐‐+
| 8 MiB, 16 MiB                 | 16 KiB                        |
+‐‐‐+‐‐‐+

And you could create more than 1024 sub-blocks per block? and 4k is size of 
sub-block for 16M?
That is great, since 4K files will go into data pool, and anything less than 4K 
will go to system (metadata) pool?
Do you think - there would be any performance degradation for reducing the 
sub-blocks to 4K - 8K, from the default 16K for 16M filesystem?

If we are not loosing any blocks by choosing a bigger block-size (16M) for 
filesystem, why would we want to choose a smaller block-size for filesystem 
(4M)?
What advantage would smaller block-size (4M) give, compared to 16M with 
performance since 16M filesystem could store small files and read small files 
too at the respective sizes? And Near Line Rotating disks would be happy with 
bigger block-size than smaller block-size i guess?

Regards,
Lohit

On Mar 30, 2018, 12:45 PM -0400, Marc A Kaplan , wrote:
>
> subblock
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] sublocks per block in GPFS 5.0

2018-03-30 Thread valleru
Hello Everyone,

I am a little bit confused with the number of sub-blocks per block-size of 16M 
in GPFS 5.0.

In the below documentation, it mentions that the number of sub-blocks per block 
is 16K, but "only for Spectrum Scale RAID"

https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/

However, when i created the filesystem “without” spectrum scale RAID. I still 
see that the number of sub-blocks per block is 1024.

mmlsfs  --subblocks-per-full-block
flag                value                    description
---  ---
 --subblocks-per-full-block 1024             Number of subblocks per full block

So May i know if the sub-blocks per block-size really 16K? or am i missing 
something?

Regards,
Lohit
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread valleru
Thanks, I will try the file heat feature but i am really not sure, if it would 
work - since the code can access cold files too, and not necessarily files 
recently accessed/hot files.

With respect to LROC. Let me explain as below:

The use case is that -
The code initially reads headers (small region of data) from thousands of files 
as the first step. For example about 30,000 of them with each about 300MB to 
500MB in size.
After the first step, with the help of those headers - it mmaps/seeks across 
various regions of a set of files in parallel.
Since its all small IOs and it was really slow at reading from GPFS over the 
network directly from disks - Our idea was to use AFM which i believe fetches 
all file data into flash/ssds, once the initial few blocks of the files are 
read.
But again - AFM seems to not solve the problem, so i want to know if LROC 
behaves in the same way as AFM, where all of the file data is prefetched in 
full block size utilizing all the worker threads  - if few blocks of the file 
is read initially.

Thanks,
Lohit

On Feb 22, 2018, 4:52 PM -0500, IBM Spectrum Scale , wrote:
> My apologies for not being more clear on the flash storage pool.  I meant 
> that this would be just another GPFS storage pool in the same cluster, so no 
> separate AFM cache cluster.  You would then use the file heat feature to 
> ensure more frequently accessed files are migrated to that all flash storage 
> pool.
>
> As for LROC could you please clarify what you mean by a few headers/stubs of 
> the file?  In reading the LROC documentation and the LROC variables available 
> in the mmchconfig command I think you might want to take a look a the 
> lrocDataStubFileSize variable since it seems to apply to your situation.
>
> Regards, The Spectrum Scale (GPFS) team
>
> --
> If you feel that your question can benefit other users of  Spectrum Scale 
> (GPFS), then please post it to the public IBM developerWroks Forum at 
> https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479.
>
> If your query concerns a potential software error in Spectrum Scale (GPFS) 
> and you have an IBM software maintenance contract please contact  
> 1-800-237-5511 in the United States or your local IBM Service Center in other 
> countries.
>
> The forum is informally monitored as time permits and should not be used for 
> priority messages to the Spectrum Scale (GPFS) team.
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        gpfsug main discussion list 
> Cc:        gpfsug-discuss-boun...@spectrumscale.org
> Date:        02/22/2018 04:21 PM
> Subject:        Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> Thank you.
>
> I am sorry if i was not clear, but the metadata pool is all on SSDs in the 
> GPFS clusters that we use. Its just the data pool that is on Near-Line 
> Rotating disks.
> I understand that AFM might not be able to solve the issue, and I will try 
> and see if file heat works for migrating the files to flash tier.
> You mentioned an all flash storage pool for heavily used files - so you mean 
> a different GPFS cluster just with flash storage, and to manually copy the 
> files to flash storage whenever needed?
> The IO performance that i am talking is prominently for reads, so you mention 
> that LROC can work in the way i want it to? that is prefetch all the files 
> into LROC cache, after only few headers/stubs of data are read from those 
> files?
> I thought LROC only keeps that block of data that is prefetched from the 
> disk, and will not prefetch the whole file if a stub of data is read.
> Please do let me know, if i understood it wrong.
>
> On Feb 22, 2018, 4:08 PM -0500, IBM Spectrum Scale , wrote:
> I do not think AFM is intended to solve the problem you are trying to solve.  
> If I understand your scenario correctly you state that you are placing 
> metadata on NL-SAS storage.  If that is true that would not be wise 
> especially if you are going to do many metadata operations.  I suspect your 
> performance issues are partially due to the fact that metadata is being 
> stored on NL-SAS storage.  You stated that you did not think the file heat 
> feature would do what you intended but have you tried to use it to see if it 
> could solve your problem?  I would think having metadata on SSD/flash storage 
> combined with a all flash storage pool for your heavily used files would 
> perform well.  If you expect IO usage will be such that there will be far 
> more reads than writes then LROC should be beneficial to your overall 
> performance.
>
> Regards, The Spectrum Scale (GPFS) team
>
> 

Re: [gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread valleru
Thanks a lot Sven.
I was trying out all the scenarios that Ray mentioned, with respect to lroc and 
all flash GPFS cluster and nothing seemed to be effective.

As of now, we are deploying a new test cluster on GPFS 5.0 and it would be good 
to know the respective features that could be enabled and see if it improves 
anything.

On the other side, i have seen various cases in my past 6 years with GPFS, 
where different tools do frequently use mmap. This dates back to 2013.. 
http://www.spectrumscale.org/pipermail/gpfsug-discuss/2013-May/000253.html when 
one of my colleagues asked the same question. At that time, it was a homegrown 
application that was using mmap, along with few other genomic pipelines.
An year ago, we had issue with mmap and lot of threads where GPFS would just 
hang without any traces or logs, which was fixed recently. That was related to 
relion :
https://sbgrid.org/software/titles/relion

The issue that we are seeing now is ML/DL workloads, and is related to 
implementing external tools such as openslide (http://openslide.org/), pytorch 
(http://pytorch.org/) with field of application being deep learning for 
thousands of image patches.

The IO is really slow when accessed from hard disk, and thus i was trying out 
other options such as LROC and flash cluster/afm cluster. But everything has a 
limitation as Ray mentioned.

Thanks,
Lohit

On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote:
> Hi Lohit,
>
> i am working with ray on a mmap performance improvement right now, which most 
> likely has the same root cause as yours , see -->  
> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> the thread above is silent after a couple of back and rorth, but ray and i 
> have active communication in the background and will repost as soon as there 
> is something new to share.
> i am happy to look at this issue after we finish with ray's workload if there 
> is something missing, but first let's finish his, get you try the same fix 
> and see if there is something missing.
>
> btw. if people would share their use of MMAP , what applications they use 
> (home grown, just use lmdb which uses mmap under the cover, etc) please let 
> me know so i get a better picture on how wide the usage is with GPFS. i know 
> a lot of the ML/DL workloads are using it, but i would like to know what else 
> is out there i might not think about. feel free to drop me a personal note, i 
> might not reply to it right away, but eventually.
>
> thx. sven
>
>
> > On Thu, Feb 22, 2018 at 12:33 PM  wrote:
> > > Hi all,
> > >
> > > I wanted to know, how does mmap interact with GPFS pagepool with respect 
> > > to filesystem block-size?
> > > Does the efficiency depend on the mmap read size and the block-size of 
> > > the filesystem even if all the data is cached in pagepool?
> > >
> > > GPFS 4.2.3.2 and CentOS7.
> > >
> > > Here is what i observed:
> > >
> > > I was testing a user script that uses mmap to read from 100M to 500MB 
> > > files.
> > >
> > > The above files are stored on 3 different filesystems.
> > >
> > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > >
> > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the 
> > > required files fully cached" from the above GPFS cluster as home. Data 
> > > and Metadata together on SSDs
> > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > >
> > > When i run the script first time for “each" filesystem:
> > > I see that GPFS reads from the files, and caches into the pagepool as it 
> > > reads, from mmdiag -- iohist
> > >
> > > When i run the second time, i see that there are no IO requests from the 
> > > compute node to GPFS NSD servers, which is expected since all the data 
> > > from the 3 filesystems is cached.
> > >
> > > However - the time taken for the script to run for the files in the 3 
> > > different filesystems is different - although i know that they are just 
> > > "mmapping"/reading from pagepool/cache and not from disk.
> > >
> > > Here is the difference in time, for IO just from pagepool:
> > >
> > > 20s 4M block size
> > > 15s 1M block size
> > > 40S 16M block size.
> > >
> > > Why do i see a difference when trying to mmap reads from different 
> > > block-size filesystems, although i see that the IO requests are not 
> > > hitting disks and just the pagepool?
> > >
> > > I am willing to share the strace output and mmdiag outputs if needed.
> > >
> > > Thanks,
> > > Lohit
> > >
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at 

Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread valleru
Thank you.

I am sorry if i was not clear, but the metadata pool is all on SSDs in the GPFS 
clusters that we use. Its just the data pool that is on Near-Line Rotating 
disks.
I understand that AFM might not be able to solve the issue, and I will try and 
see if file heat works for migrating the files to flash tier.
You mentioned an all flash storage pool for heavily used files - so you mean a 
different GPFS cluster just with flash storage, and to manually copy the files 
to flash storage whenever needed?
The IO performance that i am talking is prominently for reads, so you mention 
that LROC can work in the way i want it to? that is prefetch all the files into 
LROC cache, after only few headers/stubs of data are read from those files?
I thought LROC only keeps that block of data that is prefetched from the disk, 
and will not prefetch the whole file if a stub of data is read.
Please do let me know, if i understood it wrong.

On Feb 22, 2018, 4:08 PM -0500, IBM Spectrum Scale , wrote:
> I do not think AFM is intended to solve the problem you are trying to solve.  
> If I understand your scenario correctly you state that you are placing 
> metadata on NL-SAS storage.  If that is true that would not be wise 
> especially if you are going to do many metadata operations.  I suspect your 
> performance issues are partially due to the fact that metadata is being 
> stored on NL-SAS storage.  You stated that you did not think the file heat 
> feature would do what you intended but have you tried to use it to see if it 
> could solve your problem?  I would think having metadata on SSD/flash storage 
> combined with a all flash storage pool for your heavily used files would 
> perform well.  If you expect IO usage will be such that there will be far 
> more reads than writes then LROC should be beneficial to your overall 
> performance.
>
> Regards, The Spectrum Scale (GPFS) team
>
> --
> If you feel that your question can benefit other users of  Spectrum Scale 
> (GPFS), then please post it to the public IBM developerWroks Forum at 
> https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479.
>
> If your query concerns a potential software error in Spectrum Scale (GPFS) 
> and you have an IBM software maintenance contract please contact  
> 1-800-237-5511 in the United States or your local IBM Service Center in other 
> countries.
>
> The forum is informally monitored as time permits and should not be used for 
> priority messages to the Spectrum Scale (GPFS) team.
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        gpfsug main discussion list 
> Date:        02/22/2018 03:11 PM
> Subject:        [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> Hi All,
>
> I am trying to figure out a GPFS tiering architecture with flash storage in 
> front end and near line storage as backend, for Supercomputing
>
> The Backend storage will be a GPFS storage on near line of about 8-10PB. The 
> backend storage will/can be tuned to give out large streaming bandwidth and 
> enough metadata disks to make the stat of all these files fast enough.
>
> I was thinking if it would be possible to use a GPFS flash cluster or GPFS 
> SSD cluster in front end that uses AFM and acts as a cache cluster with the 
> backend GPFS cluster.
>
> At the end of this .. the workflow that i am targeting is where:
>
>
> “
> If the compute nodes read headers of thousands of large files ranging from 
> 100MB to 1GB, the AFM cluster should be able to bring up enough threads to 
> bring up all of the files from the backend to the faster SSD/Flash GPFS 
> cluster.
> The working set might be about 100T, at a time which i want to be on a 
> faster/low latency tier, and the rest of the files to be in slower tier until 
> they are read by the compute nodes.
> “
>
>
> I do not want to use GPFS policies to achieve the above, is because i am not 
> sure - if policies could be written in a way, that files are moved from the 
> slower tier to faster tier depending on how the jobs interact with the files.
> I know that the policies could be written depending on the heat, and 
> size/format but i don’t think thes policies work in a similar way as above.
>
> I did try the above architecture, where an SSD GPFS cluster acts as an AFM 
> cache cluster before the near line storage. However the AFM cluster was 
> really really slow, It took it about few hours to copy the files from near 
> line storage to AFM cache cluster.
> I am not sure if AFM is not designed to work this way, or if AFM is not tuned 
> to work as fast as it should.
>
> I have tried LROC too, but it does not behave the same way as i guess AFM 
> works.
>
> Has anyone tried or know if GPFS supports an architecture - where 

[gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread valleru
Hi all,

I wanted to know, how does mmap interact with GPFS pagepool with respect to 
filesystem block-size?
Does the efficiency depend on the mmap read size and the block-size of the 
filesystem even if all the data is cached in pagepool?

GPFS 4.2.3.2 and CentOS7.

Here is what i observed:

I was testing a user script that uses mmap to read from 100M to 500MB files.

The above files are stored on 3 different filesystems.

Compute nodes - 10G pagepool and 5G seqdiscardthreshold.

1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near 
line and metadata on SSDs
2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required 
files fully cached" from the above GPFS cluster as home. Data and Metadata 
together on SSDs
3. 16M block size GPFS filesystem, with separate metadata and data. Data on 
Near line and metadata on SSDs

When i run the script first time for “each" filesystem:
I see that GPFS reads from the files, and caches into the pagepool as it reads, 
from mmdiag -- iohist

When i run the second time, i see that there are no IO requests from the 
compute node to GPFS NSD servers, which is expected since all the data from the 
3 filesystems is cached.

However - the time taken for the script to run for the files in the 3 different 
filesystems is different - although i know that they are just 
"mmapping"/reading from pagepool/cache and not from disk.

Here is the difference in time, for IO just from pagepool:

20s 4M block size
15s 1M block size
40S 16M block size.

Why do i see a difference when trying to mmap reads from different block-size 
filesystems, although i see that the IO requests are not hitting disks and just 
the pagepool?

I am willing to share the strace output and mmdiag outputs if needed.

Thanks,
Lohit

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread valleru
Hi All,

I am trying to figure out a GPFS tiering architecture with flash storage in 
front end and near line storage as backend, for Supercomputing

The Backend storage will be a GPFS storage on near line of about 8-10PB. The 
backend storage will/can be tuned to give out large streaming bandwidth and 
enough metadata disks to make the stat of all these files fast enough.

I was thinking if it would be possible to use a GPFS flash cluster or GPFS SSD 
cluster in front end that uses AFM and acts as a cache cluster with the backend 
GPFS cluster.

At the end of this .. the workflow that i am targeting is where:


“
If the compute nodes read headers of thousands of large files ranging from 
100MB to 1GB, the AFM cluster should be able to bring up enough threads to 
bring up all of the files from the backend to the faster SSD/Flash GPFS cluster.
The working set might be about 100T, at a time which i want to be on a 
faster/low latency tier, and the rest of the files to be in slower tier until 
they are read by the compute nodes.
“


I do not want to use GPFS policies to achieve the above, is because i am not 
sure - if policies could be written in a way, that files are moved from the 
slower tier to faster tier depending on how the jobs interact with the files.
I know that the policies could be written depending on the heat, and 
size/format but i don’t think thes policies work in a similar way as above.

I did try the above architecture, where an SSD GPFS cluster acts as an AFM 
cache cluster before the near line storage. However the AFM cluster was really 
really slow, It took it about few hours to copy the files from near line 
storage to AFM cache cluster.
I am not sure if AFM is not designed to work this way, or if AFM is not tuned 
to work as fast as it should.

I have tried LROC too, but it does not behave the same way as i guess AFM works.

Has anyone tried or know if GPFS supports an architecture - where the fast tier 
can bring up thousands of threads and copy the files almost 
instantly/asynchronously from the slow tier, whenever the jobs from compute 
nodes reads few blocks from these files?
I understand that with respect to hardware - the AFM cluster should be really 
fast, as well as the network between the AFM cluster and the backend cluster.

Please do also let me know, if the above workflow can be done using GPFS 
policies and be as fast as it is needed to be.

Regards,
Lohit


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss