Re: [gpfsug-discuss] Change uidNumber and gidNumber for billions of files
Thank you everyone for the Inputs. The answers to some of the questions are as follows: > From Jez: I've done this a few times in the past in a previous life. In many > respects it is easier (and faster!) to remap the AD side to the uids already > on the filesystem. - Yes we had considered/attempted this, and it does work pretty good. It is actually much faster than using SSSD auto id mapping. But the main issue with this approach was to automate entry of uidNumbers and gidNumbers for all the enterprise users/groups across the agency. Both the approaches have there pros and cons. For now, we wanted to see the amount of effort that would be needed to change the uidNumbers and gidNumbers on the filesystem side, in case the other option of entering existing uidNumber/gidNumber data on AD does not work out. > Does the filesystem have ACLs? And which ACLs? Since we have CES servers that export the filesystems on SMB protocol -> The filesystems use NFS4 ACL mode. As far as we know - We know of only one fileset that is extensively using NFS4 ACLs. > Can we take a downtime to do this change? For the current GPFS storage clusters which are are production - we are thinking of taking a downtime to do the same per cluster. For new clusters/storage clusters, we are thinking of changing to AD before any new data is written to the storage. > Do the uidNumbers/gidNumbers conflict? No. The current uidNumber and gidNumber are in 1000 - 8000 range, while the new uidNumbers,gidNumbers are above 100. I was thinking of taking a backup of the current state of the filesystem, with respect to posix permissions/owner/group and the respective quotas. Disable quotas with a downtime before making changes. I might mostly start small with a single lab, and only change files without ACLs. May I know if anyone has a method/tool to find out which files/dirs have NFS4 ACLs set? As far as we know - it is just one fileset/lab, but it would be good to confirm if we have them set across any other files/dirs in the filesystem. The usual methods do not seem to work. Jonathan/Aaron, Thank you for the inputs regarding the scripts/APIs/symlinks and ACLs. I will try to see what I can do given the current state. I too wish GPFS API could be better at managing this kind of scenarios but I understand that this kind of huge changes might be pretty rare. Thank you, Lohit On June 10, 2020 at 6:33:45 AM, Jonathan Buzzard (jonathan.buzz...@strath.ac.uk) wrote: On 10/06/2020 02:15, Aaron Knister wrote: > Lohit, > > I did this while working @ NASA. I had two tools I used, one > affectionately known as "luke file walker" (to modify traditional unix > permissions) and the other known as the "milleniumfacl" (to modify posix > ACLs). Stupid jokes aside, there were some real technical challenges here. > > I don't know if anyone from the NCCS team at NASA is on the list, but if > they are perhaps they'll jump in if they're willing to share the code :) > > From what I recall, I used uthash and the gpfs API's to store in-memory > a hash of inodes and their uid/gid information. I then walked the > filesystem using the gpfs API's and could lookup the given inode in the > in-memory hash to view its ownership details. Both the inode traversal > and directory walk were parallelized/threaded. They way I actually > executed the chown was particularly security-minded. There is a race > condition that exists if you chown /path/to/file. All it takes is either > a malicious user or someone monkeying around with the filesystem while > it's live to accidentally chown the wrong file if a symbolic link ends > up in the file path. Well I would expect this needs to be done with no user access to the system. Or at the very least no user access for the bits you are currently modifying. Otherwise you are going to end up in a complete mess. > My work around was to use openat() and fchmod (I > think that was it, I played with this quite a bit to get it right) and > for every path to be chown'd I would walk the hierarchy, opening each > component with the O_NOFOLLOW flags to be sure I didn't accidentally > stumble across a symlink in the way. Or you could just use lchown so you change the ownership of the symbolic link rather than the file it is pointing to. You need to change the ownership of the symbolic link not the file it is linking to, that will be picked up elsewhere in the scan. If you don't change the ownership of the symbolic link you are going to be left with a bunch of links owned by none existent users. No race condition exists if you are doing it properly in the first place :-) I concluded that the standard nftw system call was more suited to this than the GPFS inode scan. I could see no way to turn an inode into a path to the file which lchownn, gpfs_getacl and gpfs_putacl all use. I think the problem with the GPFS inode scan is that is is for a
[gpfsug-discuss] Change uidNumber and gidNumber for billions of files
Hello Everyone, We are planning to migrate from LDAP to AD, and one of the best solution was to change the uidNumber and gidNumber to what SSSD or Centrify would resolve. May I know, if anyone has come across a tool/tools that can change the uidNumbers and gidNumbers of billions of files efficiently and in a reliable manner? We could spend some time to write a custom script, but wanted to know if a tool already exists. Please do let me know, if any one else has come across a similar situation, and the steps/tools used to resolve the same. Regards, Lohit___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] Network switches/architecture for GPFS
Hello All, I would like to discuss or understand on which ethernet networking switches/architecture seems to work best with GPFS. We had thought about infiniband, but are not yet ready to move to infiniband because of the complexity/upgrade and debugging issues that come with it. Current hardware: We are currently using Arista 7328x 100G core switch for networking among the GPFS clusters and the compute nodes. It is heterogeneous network, with some of the servers on 10G/25G/100G with LACP and without LACP. For example: GPFS storage clusters either have 25G LACP, or 10G LACP, or a single 100G network port. Compute nodes range from 10G to 100G. Login nodes/transfer servers etc have 25G bonded. Most of the servers have Mellanox ConnectX-4 or ConnectX-5 adapters. But we also have few older Intel,Broadcom and Chelsio network cards in the clusters. Most of the transceivers that we use are Mellanox,Finisar,Intel. Issue: We had upgraded to the above switch recently, and we had seen that it is not able to handle the network traffic because of higher NSD servers bandwidth vs lower compute node bandwidth. One issue that we did see was a lot of network discards on the switch side and network congestion with slow IO performance on respective compute nodes. Once we enabled ECN - we did see that it had reduced the network congestion. We do see expels once in a while, but that is mostly related to the network errors or the host not responding. We observed that bonding/LACP does make expels much more trickier, so we have decided to go with no LACP until GPFS code gets better at handling LACP - which I think they are working on. We have heard that our current switch is a shallow buffer switch, and we would need a higher/deep buffer Arista switch to perform better with no congestion/lesser latency and more throughput. On the other side, Mellanox promises to use better ASIC design and buffer architecture with spine leaf design, instead of one deep buffer core switch to get better performance than Arista. Most of the applications that run on the clusters are either genomic applications on CPUs and deep learning applications on GPUs. All of our GPFS storage cluster versions are above 5.0.2 with the compute filesystems at 16M block size on near line rotating disks, and Flash storage at 512K block size. May I know if could feedback from anyone who is using Arista or Mellanox switches on the clusters to understand the pros and cons, stability and the performance numbers of the same? Thank you, Lohit___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Maxblocksize tuning alternatives/max number of buffers
Hello Anderson, This application requires minimum throughput of about 10-13MB/s initially and almost no IOPS during first phase where it opens all the files and reads the headers and about 30MB/s throughput during the second phase. The issue that I face is during the second phase where it tries to randomly read about 4K of block size from random files from 2 to about 10. In this phase - I see a big difference in maxblocksize parameter changing the performance of the reads, with almost no throughput and may be around 2-4K IOPS. This issue is a follow up to the previous issue that I had mentioned about an year ago - where I see differences in performance - “though there is practically no IO to the storage” I mean - I see a difference in performance between different FS block-sizes even if all data is cached in pagepool. Sven had replied to that thread mentioning that it could be because of buffer locking issue. The info requested is as below: 4 Storage clusters: Storage cluster for compute: 5.0.3-2 GPFS version FS version: 19.01 (5.0.1.0) Subblock size: 16384 Blocksize : 16M Flash Storage Cluster for compute: 5.0.4-2 GPFS version FS version: 18.00 (5.0.0.0) Subblock size: 8192 Blocksize: 512K Storage cluster for admin tools: 5.0.4-2 GPFS version FS version: 16.00 (4.2.2.0) Subblock size: 131072 Blocksize: 4M Storage cluster for archival: 5.0.3-2 GPFS version FS version: 16.00 (4.2.2.0) Subblock size: 32K Blocksize: 1M The only two clusters that users do/will do compute on is the 16M filesystem and the 512K Filesystem. When you ask what is the throughput/IOPS and block size - it varies a lot and has not been recorded. The 16M FS is capable of doing about 27GB/s seq read for about 1.8 PB of storage. The 512K FS is capable of doing about 10-12GB/s seq read for about 100T of storage. Now as I mentioned previously - the issue that I am seeing has been related to different FS block sizes on the same storage. For example: On the Flash Storage cluster: Block size of 512K with maxblocksize of 16M gives worse performance than Block size of 512K with maxblocksize of 512K. It is the maxblocksize that is affecting the performance, on the same storage with same block size and everything else being the same. I am thinking the above is because of the number of buffers involved, but would like to learn if it happens to be anything else. I have debugged the same with IBM GPFS techs and it has been found that there is no issue with the storage itself or any of the other GPFS tuning parameters. Now since we do know that maxblocksize is making a big difference. I would like to keep it as low as possible but still be able to mount other remote GPFS filesystems with higher block sizes. Or since it is required to keep the maxblocksize the same across all storage - I would like to know if there is any other parameters that could do the same change as maxblocksize. Thank you, Lohit > On Feb 28, 2020, at 12:58 PM, Anderson Ferreira Nobre > wrote: > > Hi Lohit, > > First, a few questions to understand better your problem: > - What is the minimum release level of both clusters? > - What is the version of filesystem layout for 16MB, 1MB and 512KB? > - What is the subblocksize of each filesystem? > - How many IOPS, block size and throughput are you doing on each filesystem? > > Abraços / Regards / Saludos, > > Anderson Nobre > Power and Storage Consultant > IBM Systems Hardware Client Technical Team – IBM Systems Lab Services > > > > Phone: 55-19-2132-4317 > E-mail: ano...@br.ibm.com <mailto:ano...@br.ibm.com> > > > - Original message - > From: "Valleru, Lohit/Information Systems" > Sent by: gpfsug-discuss-boun...@spectrumscale.org > To: gpfsug-discuss@spectrumscale.org > Cc: > Subject: [EXTERNAL] [gpfsug-discuss] Maxblocksize tuning alternatives/max > number of buffers > Date: Fri, Feb 28, 2020 12:30 > > Hello Everyone, > > I am looking for alternative tuning parameters that could do the same job as > tuning the maxblocksize parameter. > > One of our users run a deep learning application on GPUs, that does the > following IO pattern: > > It needs to read random small sections about 4K in size from about 20,000 to > 100,000 files of each 100M to 200M size. > > When performance tuning for the above application on a 16M filesystem and > comparing it to various other file system block sizes - I realized that the > performance degradation that I see might be related to the number of buffers. > > I observed that the performance varies widely depending on what maxblocksize > parameter I use. > For example, using a 16M maxblocksize for a 512K or a 1M block size > filesystem differs widely from using a 512K or 1M maxblocksize for a 512K or > a 1M block size filesystem
[gpfsug-discuss] Maxblocksize tuning alternatives/max number of buffers
Hello Everyone, I am looking for alternative tuning parameters that could do the same job as tuning the maxblocksize parameter. One of our users run a deep learning application on GPUs, that does the following IO pattern: It needs to read random small sections about 4K in size from about 20,000 to 100,000 files of each 100M to 200M size. When performance tuning for the above application on a 16M filesystem and comparing it to various other file system block sizes - I realized that the performance degradation that I see might be related to the number of buffers. I observed that the performance varies widely depending on what maxblocksize parameter I use. For example, using a 16M maxblocksize for a 512K or a 1M block size filesystem differs widely from using a 512K or 1M maxblocksize for a 512K or a 1M block size filesystem. The reason I believe might be related to the number of buffers that I could keep on the client side, but I am not sure if that is the all that the maxblocksize is affecting. We have different file system block sizes in our environment ranging from 512K, 1M and 16M. We also use storage clusters and compute clusters design. Now in order to mount the 16M filesystem along with the other filesystems on compute clusters - we had to keep the maxblocksize to be 16M - no matter what the file system block size. I see that I get maximum performance for this application from a 512K block size filesystem and a 512K maxblocksize. However, I will not be able to mount this filesystem along with the other filesystems because I will need to change the maxblocksize to 16M in order to mount the other filesystems of 16M block size. I am thinking if there is anything else that can do the same job as maxblocksize parameter. I was thinking about the parameters like maxBufferDescs for a 16M maxblocksize, but I believe it would need a lot more pagepool to keep the same number of buffers as would be needed for a 512k maxblocksize. May I know if there is any other parameter that could help me the same as maxblocksize, and the side effects of the same? Thank you, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 81, Issue 43
body will point this out > >>> soon ;-) > >>> > >>> sven > >>> > >>> > >>> > >>> > >>> On Tue, Sep 18, 2018 at 10:31 AM wrote: > >>> > >>>> Hello All, > >>>> > >>>> This is a continuation to the previous discussion that i had with Sven. > >>>> However against what i had mentioned previously - i realize that this > >>>> is ?not? related to mmap, and i see it when doing random freads. > >>>> > >>>> I see that block-size of the filesystem matters when reading from Page > >>>> pool. > >>>> I see a major difference in performance when compared 1M to 16M, when > >>>> doing lot of random small freads with all of the data in pagepool. > >>>> > >>>> Performance for 1M is a magnitude ?more? than the performance that i > >>>> see for 16M. > >>>> > >>>> The GPFS that we have currently is : > >>>> Version : 5.0.1-0.5 > >>>> Filesystem version: 19.01 (5.0.1.0) > >>>> Block-size : 16M > >>>> > >>>> I had made the filesystem block-size to be 16M, thinking that i would > >>>> get the most performance for both random/sequential reads from 16M than > >>>> the > >>>> smaller block-sizes. > >>>> With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus > >>>> not loose lot of storage space even with 16M. > >>>> I had run few benchmarks and i did see that 16M was performing better > >>>> ?when hitting storage/disks? with respect to bandwidth for > >>>> random/sequential on small/large reads. > >>>> > >>>> However, with this particular workload - where it freads a chunk of > >>>> data randomly from hundreds of files -> I see that the number of > >>>> page-faults increase with block-size and actually reduce the performance. > >>>> 1M performs a lot better than 16M, and may be i will get better > >>>> performance with less than 1M. > >>>> It gives the best performance when reading from local disk, with 4K > >>>> block size filesystem. > >>>> > >>>> What i mean by performance when it comes to this workload - is not the > >>>> bandwidth but the amount of time that it takes to do each iteration/read > >>>> batch of data. > >>>> > >>>> I figure what is happening is: > >>>> fread is trying to read a full block size of 16M - which is good in a > >>>> way, when it hits the hard disk. > >>>> But the application could be using just a small part of that 16M. Thus > >>>> when randomly reading(freads) lot of data of 16M chunk size - it is page > >>>> faulting a lot more and causing the performance to drop . > >>>> I could try to make the application do read instead of freads, but i > >>>> fear that could be bad too since it might be hitting the disk with a very > >>>> small block size and that is not good. > >>>> > >>>> With the way i see things now - > >>>> I believe it could be best if the application does random reads of > >>>> 4k/1M from pagepool but some how does 16M from rotating disks. > >>>> > >>>> I don?t see any way of doing the above other than following a different > >>>> approach where i create a filesystem with a smaller block size ( 1M or > >>>> less > >>>> than 1M ), on SSDs as a tier. > >>>> > >>>> May i please ask for advise, if what i am understanding/seeing is right > >>>> and the best solution possible for the above scenario. > >>>> > >>>> Regards, > >>>> Lohit > >>>> > >>>> On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , > >>>> wrote: > >>>> > >>>> Hey Sven, > >>>> > >>>> This is regarding mmap issues and GPFS. > >>>> We had discussed previously of experimenting with GPFS 5. > >>>> > >>>> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > >>>> > >>>> I am yet to experiment with mmap performance, but before that - I am > >>>> seeing weird hangs with GPFS 5 and I think it could be related to mmap. > >>>> > >>&g
Re: [gpfsug-discuss] Follow-up: migrating billions of files
Thank you Marc. I was just trying to suggest another approach to this email thread. However i believe, we cannot run mmfind/mmapplypolicy with remote filesystems and can only be run on the owning cluster? In our clusters - All the gpfs clients are generally in there own compute clusters and mount filesystems from other storage clusters - which i thought is one of the recommended designs. The scripts in the /usr/lpp/mmfs/samples/util folder do work with remote filesystems, and thus on the compute nodes. I was also trying to find something that could be used by users and not by superuser… but i guess none of these tools are meant to be run by a user without superuser privileges. Regards, Lohit On Mar 8, 2019, 3:54 PM -0600, Marc A Kaplan , wrote: > Lohit... Any and all of those commands and techniques should still work with > newer version of GPFS. > > But mmapplypolicy is the supported command for generating file lists. It > uses the GPFS APIs and some parallel processing tricks. > > mmfind is a script that make it easier to write GPFS "policy rules" and runs > mmapplypolicy for you. > > mmxcp can be used with mmfind (and/or mmapplypolicy) to make it easy to run a > cp (or other command) in parallel on those filelists ... > > --marc K of GPFS > > > > From: vall...@cbio.mskcc.org > To: ""gpfsug-discuss<""gpfsug-discuss@spectrumscale.org > ", gpfsug main discussion list > > Date: 03/08/2019 10:13 AM > Subject: Re: [gpfsug-discuss] Follow-up: migrating billions of files > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as > possible when NSD disk descriptors were corrupted and shutting down GPFS > would have led to me loosing those files forever, and the other was a regular > maintenance but had to copy similar data in less time. > > In both the cases, i just used GPFS provided util scripts in > /usr/lpp/mmfs/samples/util/ . These could be run only as root i believe. I > wish i could give them to users to use. > > I had used few of those scripts like tsreaddir which used to be really fast > in listing all the paths in the directories. It prints full paths of all > files along with there inodes etc. I had modified it to print just the full > file paths. > > I then use these paths and group them up in different groups which gets fed > into a array jobs to the SGE/LSF cluster. > Each array jobs basically uses GNU parallel and running something similar to > rsync -avR . The “-R” option basically creates the directories as given. > Of course this worked because i was using the fast private network to > transfer between the storage systems. Also i know that cp or tar might be > better than rsync with respect to speed, but rsync was convenient and i could > always start over again without checkpointing or remembering where i left off > previously. > > Similar to how Bill mentioned in the previous email, but i used gpfs util > scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster > as superuser. It used to work pretty well. > > Since then - I constantly use parallel and rsync to copy large directories. > > Thank you, > Lohit > > On Mar 8, 2019, 7:43 AM -0600, William Abbott , wrote: > We had a similar situation and ended up using parsyncfp, which generates > multiple parallel rsyncs based on file lists. If they're on the same IB > fabric (as ours were) you can use that instead of ethernet, and it > worked pretty well. One caveat is that you need to follow the parallel > transfers with a final single rsync, so you can use --delete. > > For the initial transfer you can also use bbcp. It can get very good > performance but isn't nearly as convenient as rsync for subsequent > transfers. The performance isn't good with small files but you can use > tar on both ends to deal with that, in a similar way to what Uwe > suggests below. The bbcp documentation outlines how to do that. > > Bill > > On 3/6/19 8:13 AM, Uwe Falke wrote: > Hi, in that case I'd open several tar pipes in parallel, maybe using > directories carefully selected, like > > tar -c | ssh "tar -x" > > I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but > along these lines might be a good efficient method. target_hosts should be > all nodes haveing the target file system mounted, and you should start > those pipes on the nodes with the source file system. > It is best to start with the largest directories, and use some > masterscript to start the tar pipes controlled by semaphores to not > overload anything. > > > > Mit freundlichen Grüßen / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > --- > IBM Deutschland > Rathausstr.
Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share
Well, reading the user-defined authentication documentation again. It is basically left to sysadmins to deal with authentication and it looks like it would not be so much of a hack, to customize smb on CES nodes according to our needs. I will see if i could do this without much trouble. Regards, Lohit On Mar 8, 2019, 10:42 AM -0600, vall...@cbio.mskcc.org, wrote: > Thank you Simon. > > I do remember reading your page about few years back, when i was researching > this issue. > When you mentioned Custom Auth. I assumed it to be user-defined > authentication from CES. However, looks like i need to hack it a bit to get > SMB working with AD? > > I did not feel comfortable hacking the SMB from the CES cluster, and thus i > was trying to bring up SMB outside the CES cluster. I almost hack with > everything in the cluster but i leave GPFS and any of its configuration in > the supported config, because if things break - i felt it might mess up > things real bad. > I wish we do not have to hack our way out of this, and IBM supported this > config out of the box. > > I do not understand the current requirements from CES with respect to AD or > user defined authentication where either both SMB and NFS should be AD/LDAP > authenticated or both of them user defined. > > I believe many places do use just ssh-key as authentication for linux > machines including the cloud instances, while SMB obviously cannot be used > with ssh-key authentication and has to be used either with LDAP or AD > authentication. > > Did anyone try to raise this as a feature request? > > Even if i do figure to hack this thing and make sure that updating CES won’t > mess it up badly. I think i will have to do few things to get the SIDs to > Uids match as you mentioned. > We do not use passwords to authenticate to LDAP and I do not want to be > creating another set of passwords apart from AD which is already existing, > and users authenticate to it when they login to machines. > > I was thinking to bring up something like Redhat IDM that could sync with AD > and get all the usernames/sids and password hashes. I could then enter my > current LDAP uids/gids in the Redhat IDM. IDM will automatically create > uids/gids for usernames that do not have them i believe. > In this way, when SMB authenticates with Redhat IDM - users can use there > current AD kerberos tickets or the same passwords and i do not have to change > the passwords. > It will also automatically sync with AD and create UIDs/GIDs and thus i don’t > have to manually script something to create one for every person in AD. > I however need to see if i could get to make this work with institutional AD > and it might not be as smooth. > > So which of the below cases will IBM most probably support? :) > > 1. Run SMB outside the CES cluster with the above configuration. > 2. Hack SMB inside the CES cluster > > Is it that running SMB outside the CES cluster with R/W has a possibility of > corrupting the GPFS filesystem? > We do not necessarily need HA with SMB and so apart from HA - What does IBM > SMB do that would prevent such corruption from happening? > > The reason i was expecting the usernames to be same in LDAP and AD is because > - if they are, then SMB will do uid mapping by default. i.e SMB will > automatically map windows sids to ldap uids. I will not have to bring up > Redhat IDM if this was the case. But unfortunately we have many users who > have different ldap usernames from AD usernames - so i guess the practical > way would be to use Redhat IDM to map windows sids to ldap uids. > > I have read about mmname2uid and mmuid2name that Andrew mentioned but looks > like it is made to work between 2 gpfs clusters with different uids. Not > exactly to make SMB map windows SIDs to ldap uids. > > Regards, > Lohit > > On Mar 8, 2019, 2:41 AM -0600, Simon Thompson , > wrote: > > Hi Lohit, > > > > Custom auth sounds like it would work. > > > > NFS uses the “system” ldap, SMB can use LDAP or AD, or you can fudge it and > > actually use both. We came at this very early in CES and I think some of > > this is better in mixed mode now, but we do something vaguely related to > > what you need. > > > > What you’d need is data in your ldap server to map windows usernames and > > SIDs to Unix IDs. So for example we have in our mmsmb config: > > idmap config * : backend ldap > > idmap config * : bind_path_group > > ou=SidMap,dc=rds,dc=adf,dc=bham,dc=ac,dc=uk > > idmap config * : ldap_base_dn > > ou=SidMap,dc=rds,dc=adf,dc=bham,dc=ac,dc=uk > > idmap config * : ldap_server stand-alone > > idmap config * : ldap_url ldap://localhost > > idmap config * : ldap_user_dn > > uid=nslcd,ou=People,dc=rds,dc=adf,dc=bham,dc=ac,dc=uk > > idmap config * : range 1000-999 > > idmap config * : rangesize 100 > > idmap config * : read only yes > > > > You then need entries in the LDAP server, it could
Re: [gpfsug-discuss] Follow-up: migrating billions of files
I had to do this twice too. Once i had to copy a 4 PB filesystem as fast as possible when NSD disk descriptors were corrupted and shutting down GPFS would have led to me loosing those files forever, and the other was a regular maintenance but had to copy similar data in less time. In both the cases, i just used GPFS provided util scripts in /usr/lpp/mmfs/samples/util/ . These could be run only as root i believe. I wish i could give them to users to use. I had used few of those scripts like tsreaddir which used to be really fast in listing all the paths in the directories. It prints full paths of all files along with there inodes etc. I had modified it to print just the full file paths. I then use these paths and group them up in different groups which gets fed into a array jobs to the SGE/LSF cluster. Each array jobs basically uses GNU parallel and running something similar to rsync -avR . The “-R” option basically creates the directories as given. Of course this worked because i was using the fast private network to transfer between the storage systems. Also i know that cp or tar might be better than rsync with respect to speed, but rsync was convenient and i could always start over again without checkpointing or remembering where i left off previously. Similar to how Bill mentioned in the previous email, but i used gpfs util scripts and basic GNU parallel/rsync, SGE/LSF to submit jobs to the cluster as superuser. It used to work pretty well. Since then - I constantly use parallel and rsync to copy large directories. Thank you, Lohit On Mar 8, 2019, 7:43 AM -0600, William Abbott , wrote: > We had a similar situation and ended up using parsyncfp, which generates > multiple parallel rsyncs based on file lists. If they're on the same IB > fabric (as ours were) you can use that instead of ethernet, and it > worked pretty well. One caveat is that you need to follow the parallel > transfers with a final single rsync, so you can use --delete. > > For the initial transfer you can also use bbcp. It can get very good > performance but isn't nearly as convenient as rsync for subsequent > transfers. The performance isn't good with small files but you can use > tar on both ends to deal with that, in a similar way to what Uwe > suggests below. The bbcp documentation outlines how to do that. > > Bill > > On 3/6/19 8:13 AM, Uwe Falke wrote: > > Hi, in that case I'd open several tar pipes in parallel, maybe using > > directories carefully selected, like > > > > tar -c | ssh "tar -x" > > > > I am not quite sure whether "-C /" for tar works here ("tar -C / -x"), but > > along these lines might be a good efficient method. target_hosts should be > > all nodes haveing the target file system mounted, and you should start > > those pipes on the nodes with the source file system. > > It is best to start with the largest directories, and use some > > masterscript to start the tar pipes controlled by semaphores to not > > overload anything. > > > > > > > > Mit freundlichen Grüßen / Kind regards > > > > > > Dr. Uwe Falke > > > > IT Specialist > > High Performance Computing Services / Integrated Technology Services / > > Data Center Services > > --- > > IBM Deutschland > > Rathausstr. 7 > > 09111 Chemnitz > > Phone: +49 371 6978 2165 > > Mobile: +49 175 575 2877 > > E-Mail: uwefa...@de.ibm.com > > --- > > IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: > > Thomas Wolter, Sven Schooß > > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > > HRB 17122 > > > > > > > > > > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 06/03/2019 13:44 > > Subject: [gpfsug-discuss] Follow-up: migrating billions of files > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > > > > > Some of you had questions to my original post. More information: > > > > Source: > > - Files are straight GPFS/Posix - no extended NFSV4 ACLs > > - A solution that requires $?s to be spent on software (ie, Aspera) isn?t > > a very viable option > > - Both source and target clusters are in the same DC > > - Source is stand-alone NSD servers (bonded 10g-E) and 8gb FC SAN storage > > - Approx 40 file systems, a few large ones with 300M-400M files each, > > others smaller > > - no independent file sets > > - migration must pose minimal disruption to existing users > > > > Target architecture is a small number of file systems (2-3) on ESS with > > independent filesets > > - Target (ESS) will have multiple 40gb-E links on each NSD server (GS4) > > > > My current thinking is AFM with a pre-populate of the file space and > > switch the clients over to have them pull data they need (most of the data > > is older and
Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share
Thanks a lot Andrew. It does look promising but It does not strike me immediately on how this could solve the SMB export where user authenticates with an AD username but the gpfs files that are present are owned by LDAP username. May be you are saying that if i enable GPFS to use these scripts - then GPFS will map the AD username to the LDAP username? I found this url too.. https://www.ibm.com/support/knowledgecenter/en/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_uid/uid_gpfs.html I will give it a read, try to understand how to implement it and get back if i have any more questions. If this works, it should help me configure and use the CES SMB. (Hopefully, CES file based authentication will allow both ssh key authentication for NFS and AD for SMB in same CES cluster). Regards, Lohit On Mar 7, 2019, 4:52 PM -0600, Andrew Beattie , wrote: > Lohit > > Have you looked at mmUIDtoName mmNametoUID > > Yes it will require some custom scripting on your behalf but it would be a > far more elegant solution and not run the risk of data corruption issues. > > There is at least one university on this mailing list that is doing exactly > what you are talking about, and they successfully use > mmUIDtoName / mmNametoUID to provide the relevant mapping between different > authentication environments - both internally in the university and > externally from other institutions. > > They use AFM to move data between different storage clusters, and mmUIDtoName > / mmNametoUID, to manage the ACL and permissions, they then move the data > from the AFM filesystem to the HPC scratch filesystem for processing by the > HPC (different filesystems within the same cluster) > > > Regards, > Andrew Beattie > File and Object Storage Technical Specialist - A/NZ > IBM Systems - Storage > Phone: 614-2133-7927 > E-mail: abeat...@au1.ibm.com > > > > - Original message - > > From: vall...@cbio.mskcc.org > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > To: gpfsug-discuss@spectrumscale.org, gpfsug main discussion list > > > > Cc: > > Subject: Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB > > share > > Date: Fri, Mar 8, 2019 8:21 AM > > > > We have many current usernames from LDAP that do not exactly match with the > > usernames from AD. > > Unfortunately, i guess CES SMB will need us to use either AD or LDAP or use > > the same usernames in both AD and LDAP. > > I have been looking for a solution where could map the different usernames > > from LDAP and AD but have not found a solution. So exploring ways to do > > this from RHEL SMB. > > I would appreciate if you have any solution to this issue. > > > > As of now we use LDAP uids/gids and SSH keys for authentication to the HPC > > cluster. > > We want to use CES SMB to export the same mounts which have LDAP > > usernames/uids/gids however because of different usernames in AD - it has > > become a challenge. > > Even if we do find a solution to this, i want to be able to use AD > > authentication for SMB and ssh key authentication for NFS. > > > > The above are the reasons we are just using CES with NFS and user defined > > authentication for users to have access with login through ssh keys. > > > > Regards, > > Lohit > > > > On Mar 7, 2019, 3:12 PM -0600, Andrew Beattie , wrote: > > > That would not be supported > > > > > > You shouldn't publish a remote mount Protocol cluster , and then connect > > > a native client to that cluster and create a non CES protocol export > > > if you are going to use a Protocol cluster that's how you present your > > > protocols. > > > otherwise don't set up the remote mount cluster. > > > > > > Why are you trying to publish a non HA RHEL SMB share instead of using > > > the HA CES protocols? > > > Andrew Beattie > > > File and Object Storage Technical Specialist - A/NZ > > > IBM Systems - Storage > > > Phone: 614-2133-7927 > > > E-mail: abeat...@au1.ibm.com > > > > > > > > > > - Original message - > > > > From: vall...@cbio.mskcc.org > > > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > To: gpfsug-discuss@spectrumscale.org, gpfsug main discussion list > > > > > > > > Cc: > > > > Subject: Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces > > > > SMB share > > > > Date: Fri, Mar 8, 2019 7:05 AM > > > > > > > > Thank you Andrew. > > > > > > > > However, we are not using SMB from the CES cluster but instead running > > > > a Redhat based SMB on a GPFS client of the CES cluster and exporting it > > > > from the GPFS client. > > > > Is the above supported, and not known to cause any issues? > > > > > > > > Regards, > > > > Lohit > > > > > > > > On Mar 7, 2019, 2:45 PM -0600, Andrew Beattie , > > > > wrote: > > > > > > > > > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_configprotocolsonremotefs.htm > > > > ___ > > > > gpfsug-discuss mailing list > > > >
Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share
We have many current usernames from LDAP that do not exactly match with the usernames from AD. Unfortunately, i guess CES SMB will need us to use either AD or LDAP or use the same usernames in both AD and LDAP. I have been looking for a solution where could map the different usernames from LDAP and AD but have not found a solution. So exploring ways to do this from RHEL SMB. I would appreciate if you have any solution to this issue. As of now we use LDAP uids/gids and SSH keys for authentication to the HPC cluster. We want to use CES SMB to export the same mounts which have LDAP usernames/uids/gids however because of different usernames in AD - it has become a challenge. Even if we do find a solution to this, i want to be able to use AD authentication for SMB and ssh key authentication for NFS. The above are the reasons we are just using CES with NFS and user defined authentication for users to have access with login through ssh keys. Regards, Lohit On Mar 7, 2019, 3:12 PM -0600, Andrew Beattie , wrote: > That would not be supported > > You shouldn't publish a remote mount Protocol cluster , and then connect a > native client to that cluster and create a non CES protocol export > if you are going to use a Protocol cluster that's how you present your > protocols. > otherwise don't set up the remote mount cluster. > > Why are you trying to publish a non HA RHEL SMB share instead of using the HA > CES protocols? > Andrew Beattie > File and Object Storage Technical Specialist - A/NZ > IBM Systems - Storage > Phone: 614-2133-7927 > E-mail: abeat...@au1.ibm.com > > > > - Original message - > > From: vall...@cbio.mskcc.org > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > To: gpfsug-discuss@spectrumscale.org, gpfsug main discussion list > > > > Cc: > > Subject: Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB > > share > > Date: Fri, Mar 8, 2019 7:05 AM > > > > Thank you Andrew. > > > > However, we are not using SMB from the CES cluster but instead running a > > Redhat based SMB on a GPFS client of the CES cluster and exporting it from > > the GPFS client. > > Is the above supported, and not known to cause any issues? > > > > Regards, > > Lohit > > > > On Mar 7, 2019, 2:45 PM -0600, Andrew Beattie , wrote: > > > > > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_configprotocolsonremotefs.htm > > ___ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share
Thank you Andrew. However, we are not using SMB from the CES cluster but instead running a Redhat based SMB on a GPFS client of the CES cluster and exporting it from the GPFS client. Is the above supported, and not known to cause any issues? Regards, Lohit On Mar 7, 2019, 2:45 PM -0600, Andrew Beattie , wrote: > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_configprotocolsonremotefs.htm ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] Exporting remote GPFS mounts on a non-ces SMB share
Hello All, We are thinking of exporting “remote" GPFS mounts on a remote GPFS 5.0 cluster through a SMB share. I have heard in a previous thread that it is not a good idea to export NFS/SMB share on a remote GPFS mount, and make it writable. The issue that could be caused by making it writable would be metanode swapping between the GPFS clusters. May i understand this better and the seriousness of this issue? The possibility of a single file being written at the same time from a GPFS node and NFS/SMB node is minimum - however it is possible that a file is written at the same time from multiple protocols by mistake and we cannot prevent it. This is the setup: GPFS storage cluster: /gpfs01 GPFS CES cluster ( does not have any storage) : /gpfs01 -> mounted remotely . NFS export /gpfs01 as part of CES cluster GPFS client for CES cluster -> Acts as SMB server and exports /gpfs01 over SMB Are there any other limitations that i need to know for the above setup? We cannot use GPFS CES SMB as of now for few other reasons such as LDAP/AD id mapping and authentication complications. Regards, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2
Also - You could just upgrade one of the clients to this version, and test to see if the hang still occurs. You do not have to upgrade the NSD servers, to test. Regards, Lohit On Nov 2, 2018, 12:29 PM -0400, vall...@cbio.mskcc.org, wrote: > Yes, > > We have upgraded to 5.0.1-0.5, which has the patch for the issue. > The related IBM case number was : TS001010674 > > Regards, > Lohit > > On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems > , wrote: > > Hi Damir, > > > > It was related to specific user jobs and mmap (?). We opened PMR with IBM > > and have patch from IBM, since than we don’t see issue. > > > > Regards, > > > > Sveta. > > > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > > > Hi, > > > > > > Did you ever figure out the root cause of the issue? We have recently > > > (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > > > In the last few weeks we have seen an increasing number of ps hangs > > > across compute and login nodes on our cluster. The filesystem version (of > > > all filesystems on our cluster) is: > > > -V 15.01 (4.2.0.0) File system version > > > > > > I am just wondering if anyone has seen this type of issue since you first > > > reported it and if there is a known fix for it. > > > > > > Damir > > > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > > Hello All, > > > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a > > > > > month ago. We have not yet converted the 4.2.2.2 filesystem version > > > > > to 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > > Right after the upgrade, we are seeing many “ps hangs" across the > > > > > cluster. All the “ps hangs” happen when jobs run related to a Java > > > > > process or many Java threads (example: GATK ) > > > > > The hangs are pretty random, and have no particular pattern except > > > > > that we know that it is related to just Java or some jobs reading > > > > > from directories with about 60 files. > > > > > > > > > > I have raised an IBM critical service request about a month ago > > > > > related to this - PMR: 24090,L6Q,000. > > > > > However, According to the ticket - they seemed to feel that it might > > > > > not be related to GPFS. > > > > > Although, we are sure that these hangs started to appear only after > > > > > we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is > > > > > because, we are unable to capture any logs/traces from GPFS once the > > > > > hang happens. > > > > > Even GPFS trace commands hang, once “ps hangs” and thus it is getting > > > > > difficult to get any dumps from GPFS. > > > > > > > > > > Also - According to the IBM ticket, they seemed to have a seen a “ps > > > > > hang" issue and we have to run mmchconfig release=LATEST command, > > > > > and that will resolve the issue. > > > > > However we are not comfortable making the permanent change to > > > > > Filesystem version 5. and since we don’t see any near solution to > > > > > these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the > > > > > previous state that we know the cluster was stable. > > > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config > > > > > state? > > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i > > > > > reinstall all rpms to a previous version? or is there anything else > > > > > that i need to make sure with respect to GPFS configuration? > > > > > Because i think that GPFS 5.0 might have updated internal default > > > > > GPFS configuration parameters , and i am not sure if downgrading GPFS > > > > > will change them back to what they were in GPFS 4.2.3.2 > > > > > > > > > > Our previous state: > > > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage > > > > > clusters ) > > > > > > > > > > Our current state: > > > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? > > > > > or is it ok if we just downgrade the compute cluster to previous > > > > > version? > > > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > > > Thanks, > > > > > > > > > > Lohit > > > > > ___ > > > > > gpfsug-discuss mailing list > > > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ___ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > ___ > > gpfsug-discuss mailing list > >
Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2
Yes, We have upgraded to 5.0.1-0.5, which has the patch for the issue. The related IBM case number was : TS001010674 Regards, Lohit On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems , wrote: > Hi Damir, > > It was related to specific user jobs and mmap (?). We opened PMR with IBM and > have patch from IBM, since than we don’t see issue. > > Regards, > > Sveta. > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > Hi, > > > > Did you ever figure out the root cause of the issue? We have recently (end > > of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > In the last few weeks we have seen an increasing number of ps hangs across > > compute and login nodes on our cluster. The filesystem version (of all > > filesystems on our cluster) is: > > -V 15.01 (4.2.0.0) File system version > > > > I am just wondering if anyone has seen this type of issue since you first > > reported it and if there is a known fix for it. > > > > Damir > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > Hello All, > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a > > > > month ago. We have not yet converted the 4.2.2.2 filesystem version to > > > > 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > Right after the upgrade, we are seeing many “ps hangs" across the > > > > cluster. All the “ps hangs” happen when jobs run related to a Java > > > > process or many Java threads (example: GATK ) > > > > The hangs are pretty random, and have no particular pattern except that > > > > we know that it is related to just Java or some jobs reading from > > > > directories with about 60 files. > > > > > > > > I have raised an IBM critical service request about a month ago related > > > > to this - PMR: 24090,L6Q,000. > > > > However, According to the ticket - they seemed to feel that it might > > > > not be related to GPFS. > > > > Although, we are sure that these hangs started to appear only after we > > > > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is > > > > because, we are unable to capture any logs/traces from GPFS once the > > > > hang happens. > > > > Even GPFS trace commands hang, once “ps hangs” and thus it is getting > > > > difficult to get any dumps from GPFS. > > > > > > > > Also - According to the IBM ticket, they seemed to have a seen a “ps > > > > hang" issue and we have to run mmchconfig release=LATEST command, and > > > > that will resolve the issue. > > > > However we are not comfortable making the permanent change to > > > > Filesystem version 5. and since we don’t see any near solution to these > > > > hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous > > > > state that we know the cluster was stable. > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config > > > > state? > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i > > > > reinstall all rpms to a previous version? or is there anything else > > > > that i need to make sure with respect to GPFS configuration? > > > > Because i think that GPFS 5.0 might have updated internal default GPFS > > > > configuration parameters , and i am not sure if downgrading GPFS will > > > > change them back to what they were in GPFS 4.2.3.2 > > > > > > > > Our previous state: > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage > > > > clusters ) > > > > > > > > Our current state: > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? > > > > or is it ok if we just downgrade the compute cluster to previous > > > > version? > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > Thanks, > > > > > > > > Lohit > > > > ___ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > ___ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size
; > > i could think of multiple other scenarios , which is why its so > > > > > > > hard to accurately benchmark an application because you will > > > > > > > design a benchmark to test an application, but it actually almost > > > > > > > always behaves different then you think it does :-) > > > > > > > > > > > > > > so best is to run the real application and see under which > > > > > > > configuration it works best. > > > > > > > > > > > > > > you could also take a trace with trace=io and then look at > > > > > > > > > > > > > > TRACE_VNOP: READ: > > > > > > > TRACE_VNOP: WRITE: > > > > > > > > > > > > > > and compare them to > > > > > > > > > > > > > > TRACE_IO: QIO: read > > > > > > > TRACE_IO: QIO: write > > > > > > > > > > > > > > and see if the numbers summed up for both are somewhat equal. if > > > > > > > TRACE_VNOP is significant smaller than TRACE_IO you most likely > > > > > > > do more i/o than you should and turning prefetching off might > > > > > > > actually make things faster . > > > > > > > > > > > > > > keep in mind i am no longer working for IBM so all i say might be > > > > > > > obsolete by now, i no longer have access to the one and only > > > > > > > truth aka the source code ... but if i am wrong i am sure > > > > > > > somebody will point this out soon ;-) > > > > > > > > > > > > > > sven > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 18, 2018 at 10:31 AM wrote: > > > > > > > > > Hello All, > > > > > > > > > > > > > > > > > > This is a continuation to the previous discussion that i had > > > > > > > > > with Sven. > > > > > > > > > However against what i had mentioned previously - i realize > > > > > > > > > that this is “not” related to mmap, and i see it when doing > > > > > > > > > random freads. > > > > > > > > > > > > > > > > > > I see that block-size of the filesystem matters when reading > > > > > > > > > from Page pool. > > > > > > > > > I see a major difference in performance when compared 1M to > > > > > > > > > 16M, when doing lot of random small freads with all of the > > > > > > > > > data in pagepool. > > > > > > > > > > > > > > > > > > Performance for 1M is a magnitude “more” than the performance > > > > > > > > > that i see for 16M. > > > > > > > > > > > > > > > > > > The GPFS that we have currently is : > > > > > > > > > Version : 5.0.1-0.5 > > > > > > > > > Filesystem version: 19.01 (5.0.1.0) > > > > > > > > > Block-size : 16M > > > > > > > > > > > > > > > > > > I had made the filesystem block-size to be 16M, thinking that > > > > > > > > > i would get the most performance for both random/sequential > > > > > > > > > reads from 16M than the smaller block-sizes. > > > > > > > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 > > > > > > > > > and thus not loose lot of storage space even with 16M. > > > > > > > > > I had run few benchmarks and i did see that 16M was > > > > > > > > > performing better “when hitting storage/disks” with respect > > > > > > > > > to bandwidth for random/sequential on small/large reads. > > > > > > > > > > > > > > > > > > However, with this particular workload - where it freads a > > > > > > > > > chunk of data randomly from hundreds of files -> I see that > > > > > > > > > the number of page-faults increase with block-size and > > > > > > &
Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size
> the source code ... but if i am wrong i am sure somebody will point > > > > this out soon ;-) > > > > > > > > sven > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 18, 2018 at 10:31 AM wrote: > > > > > > Hello All, > > > > > > > > > > > > This is a continuation to the previous discussion that i had with > > > > > > Sven. > > > > > > However against what i had mentioned previously - i realize that > > > > > > this is “not” related to mmap, and i see it when doing random > > > > > > freads. > > > > > > > > > > > > I see that block-size of the filesystem matters when reading from > > > > > > Page pool. > > > > > > I see a major difference in performance when compared 1M to 16M, > > > > > > when doing lot of random small freads with all of the data in > > > > > > pagepool. > > > > > > > > > > > > Performance for 1M is a magnitude “more” than the performance that > > > > > > i see for 16M. > > > > > > > > > > > > The GPFS that we have currently is : > > > > > > Version : 5.0.1-0.5 > > > > > > Filesystem version: 19.01 (5.0.1.0) > > > > > > Block-size : 16M > > > > > > > > > > > > I had made the filesystem block-size to be 16M, thinking that i > > > > > > would get the most performance for both random/sequential reads > > > > > > from 16M than the smaller block-sizes. > > > > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and > > > > > > thus not loose lot of storage space even with 16M. > > > > > > I had run few benchmarks and i did see that 16M was performing > > > > > > better “when hitting storage/disks” with respect to bandwidth for > > > > > > random/sequential on small/large reads. > > > > > > > > > > > > However, with this particular workload - where it freads a chunk of > > > > > > data randomly from hundreds of files -> I see that the number of > > > > > > page-faults increase with block-size and actually reduce the > > > > > > performance. > > > > > > 1M performs a lot better than 16M, and may be i will get better > > > > > > performance with less than 1M. > > > > > > It gives the best performance when reading from local disk, with 4K > > > > > > block size filesystem. > > > > > > > > > > > > What i mean by performance when it comes to this workload - is not > > > > > > the bandwidth but the amount of time that it takes to do each > > > > > > iteration/read batch of data. > > > > > > > > > > > > I figure what is happening is: > > > > > > fread is trying to read a full block size of 16M - which is good in > > > > > > a way, when it hits the hard disk. > > > > > > But the application could be using just a small part of that 16M. > > > > > > Thus when randomly reading(freads) lot of data of 16M chunk size - > > > > > > it is page faulting a lot more and causing the performance to drop . > > > > > > I could try to make the application do read instead of freads, but > > > > > > i fear that could be bad too since it might be hitting the disk > > > > > > with a very small block size and that is not good. > > > > > > > > > > > > With the way i see things now - > > > > > > I believe it could be best if the application does random reads of > > > > > > 4k/1M from pagepool but some how does 16M from rotating disks. > > > > > > > > > > > > I don’t see any way of doing the above other than following a > > > > > > different approach where i create a filesystem with a smaller block > > > > > > size ( 1M or less than 1M ), on SSDs as a tier. > > > > > > > > > > > > May i please ask for advise, if what i am understanding/seeing is > > > > > > right and the best solution possible for the above scenario. > > > > > > > > > > > > Regards, > > > > > > Lohit > > > > > > &g
Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size
with less than 1M. > > > It gives the best performance when reading from local disk, with 4K block > > > size filesystem. > > > > > > What i mean by performance when it comes to this workload - is not the > > > bandwidth but the amount of time that it takes to do each iteration/read > > > batch of data. > > > > > > I figure what is happening is: > > > fread is trying to read a full block size of 16M - which is good in a > > > way, when it hits the hard disk. > > > But the application could be using just a small part of that 16M. Thus > > > when randomly reading(freads) lot of data of 16M chunk size - it is page > > > faulting a lot more and causing the performance to drop . > > > I could try to make the application do read instead of freads, but i fear > > > that could be bad too since it might be hitting the disk with a very > > > small block size and that is not good. > > > > > > With the way i see things now - > > > I believe it could be best if the application does random reads of 4k/1M > > > from pagepool but some how does 16M from rotating disks. > > > > > > I don’t see any way of doing the above other than following a different > > > approach where i create a filesystem with a smaller block size ( 1M or > > > less than 1M ), on SSDs as a tier. > > > > > > May i please ask for advise, if what i am understanding/seeing is right > > > and the best solution possible for the above scenario. > > > > > > Regards, > > > Lohit > > > > > > On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , > > > wrote: > > > > Hey Sven, > > > > > > > > This is regarding mmap issues and GPFS. > > > > We had discussed previously of experimenting with GPFS 5. > > > > > > > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > > > > > > > I am yet to experiment with mmap performance, but before that - I am > > > > seeing weird hangs with GPFS 5 and I think it could be related to mmap. > > > > > > > > Have you seen GPFS ever hang on this syscall? > > > > [Tue Apr 10 04:20:13 2018] [] > > > > _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > > > > > > > I see the above ,when kernel hangs and throws out a series of trace > > > > calls. > > > > > > > > I somehow think the above trace is related to processes hanging on GPFS > > > > forever. There are no errors in GPFS however. > > > > > > > > Also, I think the above happens only when the mmap threads go above a > > > > particular number. > > > > > > > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to > > > > 4.2.3.2 . At that time , the issue happened when mmap threads go more > > > > than worker1threads. According to the ticket - it was a mmap race > > > > condition that GPFS was not handling well. > > > > > > > > I am not sure if this issue is a repeat and I am yet to isolate the > > > > incident and test with increasing number of mmap threads. > > > > > > > > I am not 100 percent sure if this is related to mmap yet but just > > > > wanted to ask you if you have seen anything like above. > > > > > > > > Thanks, > > > > > > > > Lohit > > > > > > > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: > > > > > Hi Lohit, > > > > > > > > > > i am working with ray on a mmap performance improvement right now, > > > > > which most likely has the same root cause as yours , see --> > > > > > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > > > > > the thread above is silent after a couple of back and rorth, but ray > > > > > and i have active communication in the background and will repost as > > > > > soon as there is something new to share. > > > > > i am happy to look at this issue after we finish with ray's workload > > > > > if there is something missing, but first let's finish his, get you > > > > > try the same fix and see if there is something missing. > > > > > > > > > > btw. if people would share their use of MMAP , what applications they > > > > > use (home grown, just use lmdb which uses mmap under the cover, etc) > > > > >
Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size
Hello All, This is a continuation to the previous discussion that i had with Sven. However against what i had mentioned previously - i realize that this is “not” related to mmap, and i see it when doing random freads. I see that block-size of the filesystem matters when reading from Page pool. I see a major difference in performance when compared 1M to 16M, when doing lot of random small freads with all of the data in pagepool. Performance for 1M is a magnitude “more” than the performance that i see for 16M. The GPFS that we have currently is : Version : 5.0.1-0.5 Filesystem version: 19.01 (5.0.1.0) Block-size : 16M I had made the filesystem block-size to be 16M, thinking that i would get the most performance for both random/sequential reads from 16M than the smaller block-sizes. With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose lot of storage space even with 16M. I had run few benchmarks and i did see that 16M was performing better “when hitting storage/disks” with respect to bandwidth for random/sequential on small/large reads. However, with this particular workload - where it freads a chunk of data randomly from hundreds of files -> I see that the number of page-faults increase with block-size and actually reduce the performance. 1M performs a lot better than 16M, and may be i will get better performance with less than 1M. It gives the best performance when reading from local disk, with 4K block size filesystem. What i mean by performance when it comes to this workload - is not the bandwidth but the amount of time that it takes to do each iteration/read batch of data. I figure what is happening is: fread is trying to read a full block size of 16M - which is good in a way, when it hits the hard disk. But the application could be using just a small part of that 16M. Thus when randomly reading(freads) lot of data of 16M chunk size - it is page faulting a lot more and causing the performance to drop . I could try to make the application do read instead of freads, but i fear that could be bad too since it might be hitting the disk with a very small block size and that is not good. With the way i see things now - I believe it could be best if the application does random reads of 4k/1M from pagepool but some how does 16M from rotating disks. I don’t see any way of doing the above other than following a different approach where i create a filesystem with a smaller block size ( 1M or less than 1M ), on SSDs as a tier. May i please ask for advise, if what i am understanding/seeing is right and the best solution possible for the above scenario. Regards, Lohit On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , wrote: > Hey Sven, > > This is regarding mmap issues and GPFS. > We had discussed previously of experimenting with GPFS 5. > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > I am yet to experiment with mmap performance, but before that - I am seeing > weird hangs with GPFS 5 and I think it could be related to mmap. > > Have you seen GPFS ever hang on this syscall? > [Tue Apr 10 04:20:13 2018] [] > _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > I see the above ,when kernel hangs and throws out a series of trace calls. > > I somehow think the above trace is related to processes hanging on GPFS > forever. There are no errors in GPFS however. > > Also, I think the above happens only when the mmap threads go above a > particular number. > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to > 4.2.3.2 . At that time , the issue happened when mmap threads go more than > worker1threads. According to the ticket - it was a mmap race condition that > GPFS was not handling well. > > I am not sure if this issue is a repeat and I am yet to isolate the incident > and test with increasing number of mmap threads. > > I am not 100 percent sure if this is related to mmap yet but just wanted to > ask you if you have seen anything like above. > > Thanks, > > Lohit > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: > > Hi Lohit, > > > > i am working with ray on a mmap performance improvement right now, which > > most likely has the same root cause as yours , see --> > > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > > the thread above is silent after a couple of back and rorth, but ray and i > > have active communication in the background and will repost as soon as > > there is something new to share. > > i am happy to look at this issue after we finish with ray's workload if > > there is something missing, but first let's finish his, get you try the > > same fix and see if there is something missing. > > > > btw. if people would share their use of MMAP , what applicatio
Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2
Thank Dwayne. I don’t think, we are facing anything else from network perspective as of now. We were seeing deadlocks initially when we upgraded to 5.0, but it might not be because of network. We also see deadlocks now, but they are mostly caused due to high waiters i believe. I have temporarily disabled deadlocks. Thanks, Lohit On May 22, 2018, 12:54 PM -0400, dwayne.h...@med.mun.ca, wrote: > We are having issues with ESS/Mellanox implementation and were curious as to > what you were working with from a network perspective. > > Best, > Dwayne > — > Dwayne Hart | Systems Administrator IV > > CHIA, Faculty of Medicine > Memorial University of Newfoundland > 300 Prince Philip Drive > St. John’s, Newfoundland | A1B 3V6 > Craig L Dobbin Building | 4M409 > T 709 864 6631 > > On May 22, 2018, at 2:10 PM, "vall...@cbio.mskcc.org" >wrote: > > > 10G Ethernet. > > > > Thanks, > > Lohit > > > > On May 22, 2018, 11:55 AM -0400, dwayne.h...@med.mun.ca, wrote: > > > Hi Lohit, > > > > > > What type of network are you using on the back end to transfer the GPFS > > > traffic? > > > > > > Best, > > > Dwayne > > > > > > From: gpfsug-discuss-boun...@spectrumscale.org > > > [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of > > > vall...@cbio.mskcc.org > > > Sent: Tuesday, May 22, 2018 1:13 PM > > > To: gpfsug main discussion list > > > Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading > > > from GPFS 5.0.0-2 to GPFS 4.2.3.2 > > > > > > Hello All, > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month > > > ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( > > > That is we have not run the mmchconfig release=LATEST command) > > > Right after the upgrade, we are seeing many “ps hangs" across the > > > cluster. All the “ps hangs” happen when jobs run related to a Java > > > process or many Java threads (example: GATK ) > > > The hangs are pretty random, and have no particular pattern except that > > > we know that it is related to just Java or some jobs reading from > > > directories with about 60 files. > > > > > > I have raised an IBM critical service request about a month ago related > > > to this - PMR: 24090,L6Q,000. > > > However, According to the ticket - they seemed to feel that it might not > > > be related to GPFS. > > > Although, we are sure that these hangs started to appear only after we > > > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > One of the other reasons we are not able to prove that it is GPFS is > > > because, we are unable to capture any logs/traces from GPFS once the hang > > > happens. > > > Even GPFS trace commands hang, once “ps hangs” and thus it is getting > > > difficult to get any dumps from GPFS. > > > > > > Also - According to the IBM ticket, they seemed to have a seen a “ps > > > hang" issue and we have to run mmchconfig release=LATEST command, and > > > that will resolve the issue. > > > However we are not comfortable making the permanent change to Filesystem > > > version 5. and since we don’t see any near solution to these hangs - we > > > are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we > > > know the cluster was stable. > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config > > > state? > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i > > > reinstall all rpms to a previous version? or is there anything else that > > > i need to make sure with respect to GPFS configuration? > > > Because i think that GPFS 5.0 might have updated internal default GPFS > > > configuration parameters , and i am not sure if downgrading GPFS will > > > change them back to what they were in GPFS 4.2.3.2 > > > > > > Our previous state: > > > > > > 2 Storage clusters - 4.2.3.2 > > > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters > > > ) > > > > > > Our current state: > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > 1 Compute cluster - 5.0.0.2 > > > > > > Do i need to downgrade all the clusters to go to the previous state ? or > > > is it ok if we just downgrade the compute cluster to previous version? > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > Thanks, > > > > > > Lohit > > > ___ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > ___ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___
Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2
10G Ethernet. Thanks, Lohit On May 22, 2018, 11:55 AM -0400, dwayne.h...@med.mun.ca, wrote: > Hi Lohit, > > What type of network are you using on the back end to transfer the GPFS > traffic? > > Best, > Dwayne > > From: gpfsug-discuss-boun...@spectrumscale.org > [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of > vall...@cbio.mskcc.org > Sent: Tuesday, May 22, 2018 1:13 PM > To: gpfsug main discussion list> Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading > from GPFS 5.0.0-2 to GPFS 4.2.3.2 > > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month > ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is > we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many “ps hangs" across the cluster. > All the “ps hangs” happen when jobs run related to a Java process or many > Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we > know that it is related to just Java or some jobs reading from directories > with about 60 files. > > I have raised an IBM critical service request about a month ago related to > this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not be > related to GPFS. > Although, we are sure that these hangs started to appear only after we > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is because, > we are unable to capture any logs/traces from GPFS once the hang happens. > Even GPFS trace commands hang, once “ps hangs” and thus it is getting > difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a “ps hang" > issue and we have to run mmchconfig release=LATEST command, and that will > resolve the issue. > However we are not comfortable making the permanent change to Filesystem > version 5. and since we don’t see any near solution to these hangs - we are > thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know > the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall > all rpms to a previous version? or is there anything else that i need to make > sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS > configuration parameters , and i am not sure if downgrading GPFS will change > them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or is > it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2
Hello All, We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) Right after the upgrade, we are seeing many “ps hangs" across the cluster. All the “ps hangs” happen when jobs run related to a Java process or many Java threads (example: GATK ) The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 60 files. I have raised an IBM critical service request about a month ago related to this - PMR: 24090,L6Q,000. However, According to the ticket - they seemed to feel that it might not be related to GPFS. Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. Even GPFS trace commands hang, once “ps hangs” and thus it is getting difficult to get any dumps from GPFS. Also - According to the IBM ticket, they seemed to have a seen a “ps hang" issue and we have to run mmchconfig release=LATEST command, and that will resolve the issue. However we are not comfortable making the permanent change to Filesystem version 5. and since we don’t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. Can downgrading GPFS take us back to exactly the previous GPFS config state? With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 Our previous state: 2 Storage clusters - 4.2.3.2 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) Our current state: 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) 1 Compute cluster - 5.0.0.2 Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? Any advice on the best steps forward, would greatly help. Thanks, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks
Thank you for the detailed answer Andrew. I do understand that anything above the posix level will not be supported by IBM and might lead to scaling/other issues. We will start small, and discuss with IBM representative on any other possible efforts. Regards, Lohit On May 15, 2018, 10:39 PM -0400, Andrew Beattie <abeat...@au1.ibm.com>, wrote: > Lohit, > > There is no technical reason why if you use the correct licensing that you > can't publish a Posix fileystem using external Protocol tool rather than CES > the key thing to note is that if its not the IBM certified solution that IBM > support stops at the Posix level and the protocol issues are your own to > resolve. > > The reason we provide the CES environment is to provide a supported > architecture to deliver protocol access, does it have some limitations - > certainly > but it is a supported environment. Moving away from this moves the risk onto > the customer to resolve and maintain. > > The other part of this, and potentially the reason why you might have been > warned off using an external solution is that not all systems provide > scalability and resiliency > so you may end up bumping into scaling issues by building your own > environment --- and from the sound of things this is a large complex > environment. These issues are clearly defined in the CES stack and are well > understood. moving away from this will move you into the realm of the > unknown -- again the risk becomes yours. > > it may well be worth putting a request in with your local IBM representative > to have IBM Scale protocol development team involved in your design and see > what we can support for your requirements. > > > Regards, > Andrew Beattie > Software Defined Storage - IT Specialist > Phone: 614-2133-7927 > E-mail: abeat...@au1.ibm.com > > > > - Original message - > > From: vall...@cbio.mskcc.org > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> > > Cc: > > Subject: Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks > > Date: Wed, May 16, 2018 12:25 PM > > > > Thanks Stephen, > > > > Yes i do acknowledge, that it will need a SERVER license and thank you for > > reminding me. > > > > I just wanted to make sure, from the technical point of view that we won’t > > face any issues by exporting a GPFS mount as a SMB export. > > > > I remember, i had seen in documentation about few years ago that it is not > > recommended to export a GPFS mount via Third party SMB services (not CES). > > But i don’t exactly remember why. > > > > Regards, > > Lohit > > > > On May 15, 2018, 10:19 PM -0400, Stephen Ulmer <ul...@ulmer.org>, wrote: > > > Lohit, > > > > > > Just be aware that exporting the data from GPFS via SMB requires a SERVER > > > license for the node in question. You’ve mentioned client a few times > > > now. :) > > > > > > -- > > > Stephen > > > > > > > > > > > > > On May 15, 2018, at 6:48 PM, Lohit Valleru <vall...@cbio.mskcc.org> > > > > wrote: > > > > > > > > Thanks Christof. > > > > > > > > The usecase is just that : it is easier to have symlinks of files/dirs > > > > from various locations/filesystems rather than copying or duplicating > > > > that data. > > > > > > > > The design from many years was maintaining about 8 PB of NFS filesystem > > > > with thousands of symlinks to various locations and the same > > > > directories being exported on SMB. > > > > > > > > Now we are migrating most of the data to GPFS keeping the symlinks as > > > > they are. > > > > Thus the need to follow symlinks from the GPFS filesystem to the NFS > > > > Filesystem. > > > > The client wants to effectively use the symlinks design that works when > > > > used on Linux but is not happy to hear that he will have to redo years > > > > of work just because GPFS does not support the same. > > > > > > > > I understand that there might be a reason on why CES might not support > > > > this, but is it an issue if we run SMB server on the GPFS clients to > > > > expose a read only or read write GPFS mounts? > > > > > > > > Regards, > > > > > > > > Lohit > > > > > > > > On May 15, 2018, 6:32 PM -0400, Christof Schm
Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks
Thanks Stephen, Yes i do acknowledge, that it will need a SERVER license and thank you for reminding me. I just wanted to make sure, from the technical point of view that we won’t face any issues by exporting a GPFS mount as a SMB export. I remember, i had seen in documentation about few years ago that it is not recommended to export a GPFS mount via Third party SMB services (not CES). But i don’t exactly remember why. Regards, Lohit On May 15, 2018, 10:19 PM -0400, Stephen Ulmer <ul...@ulmer.org>, wrote: > Lohit, > > Just be aware that exporting the data from GPFS via SMB requires a SERVER > license for the node in question. You’ve mentioned client a few times now. :) > > -- > Stephen > > > > > On May 15, 2018, at 6:48 PM, Lohit Valleru <vall...@cbio.mskcc.org> wrote: > > > > Thanks Christof. > > > > The usecase is just that : it is easier to have symlinks of files/dirs from > > various locations/filesystems rather than copying or duplicating that data. > > > > The design from many years was maintaining about 8 PB of NFS filesystem > > with thousands of symlinks to various locations and the same directories > > being exported on SMB. > > > > Now we are migrating most of the data to GPFS keeping the symlinks as they > > are. > > Thus the need to follow symlinks from the GPFS filesystem to the NFS > > Filesystem. > > The client wants to effectively use the symlinks design that works when > > used on Linux but is not happy to hear that he will have to redo years of > > work just because GPFS does not support the same. > > > > I understand that there might be a reason on why CES might not support > > this, but is it an issue if we run SMB server on the GPFS clients to expose > > a read only or read write GPFS mounts? > > > > Regards, > > > > Lohit > > > > On May 15, 2018, 6:32 PM -0400, Christof Schmitt > > <christof.schm...@us.ibm.com>, wrote: > > > > I could use CES, but CES does not support follow-symlinks outside > > > > respective SMB export. > > > > > > Samba has the 'wide links' option, that we currently do not test and > > > support as part of the mmsmb integration. You can always open a RFE and > > > ask that we support this option in a future release. > > > > > > > Follow-symlinks is a however a hard-requirement for to follow links > > > > outside GPFS filesystems. > > > > > > I might be reading this wrong, but do you actually want symlinks that > > > point to a file or directory outside of the GPFS file system? Could you > > > outline a usecase for that? > > > > > > Regards, > > > > > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > > > christof.schm...@us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > > > > > > > - Original message - > > > > From: vall...@cbio.mskcc.org > > > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org> > > > > Cc: > > > > Subject: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks > > > > Date: Tue, May 15, 2018 3:04 PM > > > > > > > > Hello All, > > > > > > > > Has anyone tried serving SMB export of GPFS mounts from a SMB server on > > > > GPFS client? Is it supported and does it lead to any issues? > > > > I understand that i will not need a redundant SMB server configuration. > > > > > > > > I could use CES, but CES does not support follow-symlinks outside > > > > respective SMB export. Follow-symlinks is a however a hard-requirement > > > > for to follow links outside GPFS filesystems. > > > > > > > > Thanks, > > > > Lohit > > > > > > > > > > > > ___ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > ___ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > ___ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks
Thanks Christof. The usecase is just that : it is easier to have symlinks of files/dirs from various locations/filesystems rather than copying or duplicating that data. The design from many years was maintaining about 8 PB of NFS filesystem with thousands of symlinks to various locations and the same directories being exported on SMB. Now we are migrating most of the data to GPFS keeping the symlinks as they are. Thus the need to follow symlinks from the GPFS filesystem to the NFS Filesystem. The client wants to effectively use the symlinks design that works when used on Linux but is not happy to hear that he will have to redo years of work just because GPFS does not support the same. I understand that there might be a reason on why CES might not support this, but is it an issue if we run SMB server on the GPFS clients to expose a read only or read write GPFS mounts? Regards, Lohit On May 15, 2018, 6:32 PM -0400, Christof Schmitt, wrote: > > I could use CES, but CES does not support follow-symlinks outside > > respective SMB export. > > Samba has the 'wide links' option, that we currently do not test and support > as part of the mmsmb integration. You can always open a RFE and ask that we > support this option in a future release. > > > Follow-symlinks is a however a hard-requirement for to follow links > > outside GPFS filesystems. > > I might be reading this wrong, but do you actually want symlinks that point > to a file or directory outside of the GPFS file system? Could you outline a > usecase for that? > > Regards, > > Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ > christof.schm...@us.ibm.com || +1-520-799-2469 (T/L: 321-2469) > > > > - Original message - > > From: vall...@cbio.mskcc.org > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > To: gpfsug main discussion list > > Cc: > > Subject: [gpfsug-discuss] SMB server on GPFS clients and Followsymlinks > > Date: Tue, May 15, 2018 3:04 PM > > > > Hello All, > > > > Has anyone tried serving SMB export of GPFS mounts from a SMB server on > > GPFS client? Is it supported and does it lead to any issues? > > I understand that i will not need a redundant SMB server configuration. > > > > I could use CES, but CES does not support follow-symlinks outside > > respective SMB export. Follow-symlinks is a however a hard-requirement for > > to follow links outside GPFS filesystems. > > > > Thanks, > > Lohit > > > > > > ___ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] SMB server on GPFS clients and Followsymlinks
Hello All, Has anyone tried serving SMB export of GPFS mounts from a SMB server on GPFS client? Is it supported and does it lead to any issues? I understand that i will not need a redundant SMB server configuration. I could use CES, but CES does not support follow-symlinks outside respective SMB export. Follow-symlinks is a however a hard-requirement for to follow links outside GPFS filesystems. Thanks, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] Spectrum Scale CES , SAMBA and AD keytab integration with userdefined authentication
Hello All, I am trying to export a single remote filesystem over NFS/SMB using GPFS CES. ( GPFS 5.0.0.2 and CentOS 7 ). We need NFS exports to be accessible on client nodes, that use public key authentication and ldap authorization. I already have this working with a previous CES setup on user-defined authentication, where users can just login to the client nodes, and access NFS mounts. However, i will also need SAMBA exports for the same GPFS filesystem with AD/kerberos authentication. Previously, we used to have a working SAMBA export for a local filesystem with SSSD and AD integration with SAMBA as mentioned in the below solution from redhat. https://access.redhat.com/solutions/2221561 We find the above as cleaner solution with respect to AD and Samba integration compared to centrify or winbind. I understand that GPFS does offer AD authentication, however i believe i cannot use the same since NFS will need user-defined authentication and SAMBA will need AD authentication. I have thus been trying to use user-defined authentication. I tried to edit smb.cnf from GPFS ( with a bit of help from this blog, written by Simon. https://www.roamingzebra.co.uk/2015/07/smb-protocol-support-with-spectrum.html) /usr/lpp/mmfs/bin/net conf list realm = workgroup = security = ads kerberos method = secrets and key tab idmap config * : backend = tdb template homedir = /home/%U dedicated keytab file = /etc/krb5.keytab I had joined the node to AD with realmd and i do get relevant AD info when i try: /usr/lpp/mmfs/bin/net ads info However, when i try to display keytab or add principals to keytab. It just does not work. /usr/lpp/mmfs/bin/net ads keytab list -> does not show the keys present in /etc/krb5.keytab. /usr/lpp/mmfs/bin/net ads keytab add cifs -> does not add the keys to the /etc/krb5.keytab As per the samba documentation, these two parameters should help samba automatically find the keytab file. kerberos method = secrets and key tab dedicated keytab file = /etc/krb5.keytab I have not yet tried to see, if a SAMBA export is working with AD authentication but i am afraid it might not work. Have anyone tried the AD integration with SSSD/SAMBA for GPFS, and any suggestions on how to debug the above would be really helpful. Thanks, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
Thanks Simon. Currently, we are thinking of using the same remote filesystem for both NFS/SMB exports. I do have a related question with respect to SMB and AD integration on user-defined authentication. I have seen a past discussion from you on the usergroup regarding a similar integration, but i am trying a different setup. Will send an email with the related subject. Thanks, Lohit On May 3, 2018, 1:30 PM -0400, Simon Thompson (IT Research Support), wrote: > Yes we do this when we really really need to take a remote FS offline, which > we try at all costs to avoid unless we have a maintenance window. > > Note if you only export via SMB, then you don’t have the same effect (unless > something has changed recently) > > Simon > > From: on behalf of > "vall...@cbio.mskcc.org" > Reply-To: "gpfsug-discuss@spectrumscale.org" > > Date: Thursday, 3 May 2018 at 15:41 > To: "gpfsug-discuss@spectrumscale.org" > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts > > Thanks Mathiaz, > Yes i do understand the concern, that if one of the remote file systems go > down abruptly - the others will go down too. > > However, i suppose we could bring down one of the filesystems before a > planned downtime? > For example, by unexporting the filesystems on NFS/SMB before the downtime? > > I might not want to be in a situation, where i have to bring down all the > remote filesystems because of planned downtime of one of the remote clusters. > > Regards, > Lohit > > On May 3, 2018, 7:41 AM -0400, Mathias Dietz , wrote: > > > Hi Lohit, > > > > >I am thinking of using a single CES protocol cluster, with remote mounts > > >from 3 storage clusters. > > Technically this should work fine (assuming all 3 clusters use the same > > uids/guids). However this has not been tested in our Test lab. > > > > > > >One thing to watch, be careful if your CES root is on a remote fs, as if > > >that goes away, so do all CES exports. > > Not only the ces root file system is a concern, the whole CES cluster will > > go down if any remote file systems with NFS exports is not available. > > e.g. if remote cluster 1 is not available, the CES cluster will unmount the > > corresponding file system which will lead to a NFS failure on all CES nodes. > > > > > > Mit freundlichen Grüßen / Kind regards > > > > Mathias Dietz > > > > Spectrum Scale Development - Release Lead Architect (4.2.x) > > Spectrum Scale RAS Architect > > --- > > IBM Deutschland > > Am Weiher 24 > > 65451 Kelsterbach > > Phone: +49 70342744105 > > Mobile: +49-15152801035 > > E-Mail: mdi...@de.ibm.com > > - > > IBM Deutschland Research & Development GmbH > > Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk > > WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht > > Stuttgart, HRB 243294 > > > > > > > > From: vall...@cbio.mskcc.org > > To: gpfsug main discussion list > > Date: 01/05/2018 16:34 > > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file > > system mounts > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > > > > > Thanks Simon. > > I will make sure i am careful about the CES root and test nfs exporting > > more than 2 remote file systems. > > > > Regards, > > Lohit > > > > On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) > > , wrote: > > You have been able to do this for some time, though I think it's only just > > supported. > > > > We've been exporting remote mounts since CES was added. > > > > At some point we've had two storage clusters supplying data and at least 3 > > remote file-systems exported over NFS and SMB. > > > > One thing to watch, be careful if your CES root is on a remote fs, as if > > that goes away, so do all CES exports. We do have CES root on a remote fs > > and it works, just be aware... > > > > Simon > > > > From: gpfsug-discuss-boun...@spectrumscale.org > > [gpfsug-discuss-boun...@spectrumscale.org] on behalf of > > vall...@cbio.mskcc.org [vall...@cbio.mskcc.org] > > Sent: 30 April 2018 22:11 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts > > > > Hello All, > > > > I read from the below link, that it is now possible to export remote mounts > > over NFS/SMB. > > > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm > > > > I am thinking of using a single CES protocol cluster, with remote mounts > > from 3 storage
Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
Thanks Bryan. Yes i do understand it now, with respect to multi clusters reading the same file and metanode flapping. Will make sure the workload design will prevent metanode flapping. Regards, Lohit On May 3, 2018, 11:15 AM -0400, Bryan Banister, wrote: > Hi Lohit, > > Please see slides 13 and 14 in the presentation that DDN gave at the GPFS UG > in the UK this April: > http://files.gpfsug.org/presentations/2018/London/2-5_GPFSUG_London_2018_VCC_DDN_Overheads.pdf > > Multicluster setups with shared file access have a high probability of > “MetaNode Flapping” > • “MetaNode role transfer occurs when the same files from a filesystem are > accessed from two or more “client” clusters via a MultiCluster relationship.” > > Cheers, > -Bryan > > From: gpfsug-discuss-boun...@spectrumscale.org > [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of > vall...@cbio.mskcc.org > Sent: Thursday, May 03, 2018 9:46 AM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts > > Note: External Email > Thanks Brian, > May i know, if you could explain a bit more on the metadata updates issue? > I am not sure i exactly understand on why the metadata updates would fail > between filesystems/between clusters - since every remote cluster will have > its own metadata pool/servers. > I suppose the metadata updates for respective remote filesystems should go to > respective remote clusters/metadata servers and should not depend on metadata > servers of other remote clusters? > Please do correct me if i am wrong. > As of now, our workload is to use NFS/SMB to read files and update files from > different remote servers. It is not for running heavy parallel read/write > workloads across different servers. > > Thanks, > Lohit > > On May 3, 2018, 10:25 AM -0400, Bryan Banister , > wrote: > > > Hi Lohit, > > > > Just another thought, you also have to consider that metadata updates will > > have to fail between nodes in the CES cluster with those in other clusters > > because nodes in separate remote clusters do not communicate directly for > > metadata updates, which depends on your workload is that would be an issue. > > > > Cheers, > > -Bryan > > > > From: gpfsug-discuss-boun...@spectrumscale.org > > [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Mathias Dietz > > Sent: Thursday, May 03, 2018 6:41 AM > > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system > > mounts > > > > Note: External Email > > Hi Lohit, > > > > >I am thinking of using a single CES protocol cluster, with remote mounts > > >from 3 storage clusters. > > Technically this should work fine (assuming all 3 clusters use the same > > uids/guids). However this has not been tested in our Test lab. > > > > > > >One thing to watch, be careful if your CES root is on a remote fs, as if > > >that goes away, so do all CES exports. > > Not only the ces root file system is a concern, the whole CES cluster will > > go down if any remote file systems with NFS exports is not available. > > e.g. if remote cluster 1 is not available, the CES cluster will unmount the > > corresponding file system which will lead to a NFS failure on all CES nodes. > > > > > > Mit freundlichen Grüßen / Kind regards > > > > Mathias Dietz > > > > Spectrum Scale Development - Release Lead Architect (4.2.x) > > Spectrum Scale RAS Architect > > --- > > IBM Deutschland > > Am Weiher 24 > > 65451 Kelsterbach > > Phone: +49 70342744105 > > Mobile: +49-15152801035 > > E-Mail: mdi...@de.ibm.com > > - > > IBM Deutschland Research & Development GmbH > > Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk > > WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht > > Stuttgart, HRB 243294 > > > > > > > > From: vall...@cbio.mskcc.org > > To: gpfsug main discussion list > > Date: 01/05/2018 16:34 > > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file > > system mounts > > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > > > > > Thanks Simon. > > I will make sure i am careful about the CES root and test nfs exporting > > more than 2 remote file systems. > > > > Regards, > > Lohit > > > > On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) > > , wrote: > > You have been able to do this for some time, though I think it's only just > > supported. > > > > We've been exporting remote mounts since CES was added. > > > > At some point we've had two storage clusters supplying data and at least 3 > > remote file-systems
Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
Thanks Brian, May i know, if you could explain a bit more on the metadata updates issue? I am not sure i exactly understand on why the metadata updates would fail between filesystems/between clusters - since every remote cluster will have its own metadata pool/servers. I suppose the metadata updates for respective remote filesystems should go to respective remote clusters/metadata servers and should not depend on metadata servers of other remote clusters? Please do correct me if i am wrong. As of now, our workload is to use NFS/SMB to read files and update files from different remote servers. It is not for running heavy parallel read/write workloads across different servers. Thanks, Lohit On May 3, 2018, 10:25 AM -0400, Bryan Banister, wrote: > Hi Lohit, > > Just another thought, you also have to consider that metadata updates will > have to fail between nodes in the CES cluster with those in other clusters > because nodes in separate remote clusters do not communicate directly for > metadata updates, which depends on your workload is that would be an issue. > > Cheers, > -Bryan > > From: gpfsug-discuss-boun...@spectrumscale.org > [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Mathias Dietz > Sent: Thursday, May 03, 2018 6:41 AM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts > > Note: External Email > Hi Lohit, > > >I am thinking of using a single CES protocol cluster, with remote mounts > >from 3 storage clusters. > Technically this should work fine (assuming all 3 clusters use the same > uids/guids). However this has not been tested in our Test lab. > > > >One thing to watch, be careful if your CES root is on a remote fs, as if > >that goes away, so do all CES exports. > Not only the ces root file system is a concern, the whole CES cluster will go > down if any remote file systems with NFS exports is not available. > e.g. if remote cluster 1 is not available, the CES cluster will unmount the > corresponding file system which will lead to a NFS failure on all CES nodes. > > > Mit freundlichen Grüßen / Kind regards > > Mathias Dietz > > Spectrum Scale Development - Release Lead Architect (4.2.x) > Spectrum Scale RAS Architect > --- > IBM Deutschland > Am Weiher 24 > 65451 Kelsterbach > Phone: +49 70342744105 > Mobile: +49-15152801035 > E-Mail: mdi...@de.ibm.com > - > IBM Deutschland Research & Development GmbH > Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk > WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht > Stuttgart, HRB 243294 > > > > From: vall...@cbio.mskcc.org > To: gpfsug main discussion list > Date: 01/05/2018 16:34 > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file > system mounts > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > Thanks Simon. > I will make sure i am careful about the CES root and test nfs exporting more > than 2 remote file systems. > > Regards, > Lohit > > On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) > , wrote: > You have been able to do this for some time, though I think it's only just > supported. > > We've been exporting remote mounts since CES was added. > > At some point we've had two storage clusters supplying data and at least 3 > remote file-systems exported over NFS and SMB. > > One thing to watch, be careful if your CES root is on a remote fs, as if that > goes away, so do all CES exports. We do have CES root on a remote fs and it > works, just be aware... > > Simon > > From: gpfsug-discuss-boun...@spectrumscale.org > [gpfsug-discuss-boun...@spectrumscale.org] on behalf of > vall...@cbio.mskcc.org [vall...@cbio.mskcc.org] > Sent: 30 April 2018 22:11 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts > > Hello All, > > I read from the below link, that it is now possible to export remote mounts > over NFS/SMB. > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm > > I am thinking of using a single CES protocol cluster, with remote mounts from > 3 storage clusters. > May i know, if i will be able to export the 3 remote mounts(from 3 storage > clusters) over NFS/SMB from a single CES protocol cluster? > > Because according to the limitations as mentioned in the below link: > > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm > > It says “You can configure one storage cluster and up to five protocol >
Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
Thanks Mathiaz, Yes i do understand the concern, that if one of the remote file systems go down abruptly - the others will go down too. However, i suppose we could bring down one of the filesystems before a planned downtime? For example, by unexporting the filesystems on NFS/SMB before the downtime? I might not want to be in a situation, where i have to bring down all the remote filesystems because of planned downtime of one of the remote clusters. Regards, Lohit On May 3, 2018, 7:41 AM -0400, Mathias Dietz, wrote: > Hi Lohit, > > >I am thinking of using a single CES protocol cluster, with remote mounts > >from 3 storage clusters. > Technically this should work fine (assuming all 3 clusters use the same > uids/guids). However this has not been tested in our Test lab. > > > >One thing to watch, be careful if your CES root is on a remote fs, as if > >that goes away, so do all CES exports. > Not only the ces root file system is a concern, the whole CES cluster will go > down if any remote file systems with NFS exports is not available. > e.g. if remote cluster 1 is not available, the CES cluster will unmount the > corresponding file system which will lead to a NFS failure on all CES nodes. > > > Mit freundlichen Grüßen / Kind regards > > Mathias Dietz > > Spectrum Scale Development - Release Lead Architect (4.2.x) > Spectrum Scale RAS Architect > --- > IBM Deutschland > Am Weiher 24 > 65451 Kelsterbach > Phone: +49 70342744105 > Mobile: +49-15152801035 > E-Mail: mdi...@de.ibm.com > - > IBM Deutschland Research & Development GmbH > Vorsitzender des Aufsichtsrats: Martina Koederitz, Geschäftsführung: Dirk > WittkoppSitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht > Stuttgart, HRB 243294 > > > > From: vall...@cbio.mskcc.org > To: gpfsug main discussion list > Date: 01/05/2018 16:34 > Subject: Re: [gpfsug-discuss] Spectrum Scale CES and remote file > system mounts > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > Thanks Simon. > I will make sure i am careful about the CES root and test nfs exporting more > than 2 remote file systems. > > Regards, > Lohit > > On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support) > , wrote: > You have been able to do this for some time, though I think it's only just > supported. > > We've been exporting remote mounts since CES was added. > > At some point we've had two storage clusters supplying data and at least 3 > remote file-systems exported over NFS and SMB. > > One thing to watch, be careful if your CES root is on a remote fs, as if that > goes away, so do all CES exports. We do have CES root on a remote fs and it > works, just be aware... > > Simon > > From: gpfsug-discuss-boun...@spectrumscale.org > [gpfsug-discuss-boun...@spectrumscale.org] on behalf of > vall...@cbio.mskcc.org [vall...@cbio.mskcc.org] > Sent: 30 April 2018 22:11 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts > > Hello All, > > I read from the below link, that it is now possible to export remote mounts > over NFS/SMB. > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm > > I am thinking of using a single CES protocol cluster, with remote mounts from > 3 storage clusters. > May i know, if i will be able to export the 3 remote mounts(from 3 storage > clusters) over NFS/SMB from a single CES protocol cluster? > > Because according to the limitations as mentioned in the below link: > > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm > > It says “You can configure one storage cluster and up to five protocol > clusters (current limit).” > > > Regards, > Lohit > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts
Thanks Simon. I will make sure i am careful about the CES root and test nfs exporting more than 2 remote file systems. Regards, Lohit On Apr 30, 2018, 5:57 PM -0400, Simon Thompson (IT Research Support), wrote: > You have been able to do this for some time, though I think it's only just > supported. > > We've been exporting remote mounts since CES was added. > > At some point we've had two storage clusters supplying data and at least 3 > remote file-systems exported over NFS and SMB. > > One thing to watch, be careful if your CES root is on a remote fs, as if that > goes away, so do all CES exports. We do have CES root on a remote fs and it > works, just be aware... > > Simon > > From: gpfsug-discuss-boun...@spectrumscale.org > [gpfsug-discuss-boun...@spectrumscale.org] on behalf of > vall...@cbio.mskcc.org [vall...@cbio.mskcc.org] > Sent: 30 April 2018 22:11 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Spectrum Scale CES and remote file system mounts > > Hello All, > > I read from the below link, that it is now possible to export remote mounts > over NFS/SMB. > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm > > I am thinking of using a single CES protocol cluster, with remote mounts from > 3 storage clusters. > May i know, if i will be able to export the 3 remote mounts(from 3 storage > clusters) over NFS/SMB from a single CES protocol cluster? > > Because according to the limitations as mentioned in the below link: > > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm > > It says “You can configure one storage cluster and up to five protocol > clusters (current limit).” > > > Regards, > Lohit > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] Spectrum Scale CES and remote file system mounts
Hello All, I read from the below link, that it is now possible to export remote mounts over NFS/SMB. https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_protocoloverremoteclu.htm I am thinking of using a single CES protocol cluster, with remote mounts from 3 storage clusters. May i know, if i will be able to export the 3 remote mounts(from 3 storage clusters) over NFS/SMB from a single CES protocol cluster? Because according to the limitations as mentioned in the below link: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_limitationofprotocolonRMT.htm It says “You can configure one storage cluster and up to five protocol clusters (current limit).” Regards, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Singularity + GPFS
We do run Singularity + GPFS, on our production HPC clusters. Most of the time things are fine without any issues. However, i do see a significant performance loss when running some applications on singularity containers with GPFS. As of now, the applications that have severe performance issues with singularity on GPFS - seem to be because of “mmap io”. (Deep learning applications) When i run the same application on bare metal, they seem to have a huge difference in GPFS IO when compared to running on singularity containers. I am yet to raise a PMR about this with IBM. I have not seen performance degradation for any other kind of IO, but i am not sure. Regards, Lohit On Apr 26, 2018, 10:35 AM -0400, Nathan Harper, wrote: > We are running on a test system at the moment, and haven't run into any > issues yet, but so far it's only been 'hello world' and running FIO. > > I'm interested to hear about experience with MPI-IO within Singularity. > > > On 26 April 2018 at 15:20, Oesterlin, Robert > > wrote: > > > Anyone (including IBM) doing any work in this area? I would appreciate > > > hearing from you. > > > > > > Bob Oesterlin > > > Sr Principal Storage Engineer, Nuance > > > > > > > > > ___ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > -- > Nathan Harper // IT Systems Lead > > > e: nathan.har...@cfms.org.uk t: 0117 906 1104 m: 0787 551 0891 w: > www.cfms.org.uk > CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // > Emersons Green // Bristol // BS16 7FR > > CFMS Services Ltd is registered in England and Wales No 05742022 - a > subsidiary of CFMS Ltd > CFMS Services Ltd registered office // 43 Queens Square // Bristol // BS1 4QP > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] GPFS, MMAP and Pagepool
Hey Sven, This is regarding mmap issues and GPFS. We had discussed previously of experimenting with GPFS 5. I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 I am yet to experiment with mmap performance, but before that - I am seeing weird hangs with GPFS 5 and I think it could be related to mmap. Have you seen GPFS ever hang on this syscall? [Tue Apr 10 04:20:13 2018] [] _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] I see the above ,when kernel hangs and throws out a series of trace calls. I somehow think the above trace is related to processes hanging on GPFS forever. There are no errors in GPFS however. Also, I think the above happens only when the mmap threads go above a particular number. We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 . At that time , the issue happened when mmap threads go more than worker1threads. According to the ticket - it was a mmap race condition that GPFS was not handling well. I am not sure if this issue is a repeat and I am yet to isolate the incident and test with increasing number of mmap threads. I am not 100 percent sure if this is related to mmap yet but just wanted to ask you if you have seen anything like above. Thanks, Lohit On Feb 22, 2018, 3:59 PM -0500, Sven Oehme, wrote: > Hi Lohit, > > i am working with ray on a mmap performance improvement right now, which most > likely has the same root cause as yours , see --> > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > the thread above is silent after a couple of back and rorth, but ray and i > have active communication in the background and will repost as soon as there > is something new to share. > i am happy to look at this issue after we finish with ray's workload if there > is something missing, but first let's finish his, get you try the same fix > and see if there is something missing. > > btw. if people would share their use of MMAP , what applications they use > (home grown, just use lmdb which uses mmap under the cover, etc) please let > me know so i get a better picture on how wide the usage is with GPFS. i know > a lot of the ML/DL workloads are using it, but i would like to know what else > is out there i might not think about. feel free to drop me a personal note, i > might not reply to it right away, but eventually. > > thx. sven > > > > On Thu, Feb 22, 2018 at 12:33 PM wrote: > > > Hi all, > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with respect > > > to filesystem block-size? > > > Does the efficiency depend on the mmap read size and the block-size of > > > the filesystem even if all the data is cached in pagepool? > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > Here is what i observed: > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB > > > files. > > > > > > The above files are stored on 3 different filesystems. > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data > > > on Near line and metadata on SSDs > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the > > > required files fully cached" from the above GPFS cluster as home. Data > > > and Metadata together on SSDs > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data > > > on Near line and metadata on SSDs > > > > > > When i run the script first time for “each" filesystem: > > > I see that GPFS reads from the files, and caches into the pagepool as it > > > reads, from mmdiag -- iohist > > > > > > When i run the second time, i see that there are no IO requests from the > > > compute node to GPFS NSD servers, which is expected since all the data > > > from the 3 filesystems is cached. > > > > > > However - the time taken for the script to run for the files in the 3 > > > different filesystems is different - although i know that they are just > > > "mmapping"/reading from pagepool/cache and not from disk. > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > 20s 4M block size > > > 15s 1M block size > > > 40S 16M block size. > > > > > > Why do i see a difference when trying to mmap reads from different > > > block-size filesystems, although i see that the IO requests are not > > > hitting disks and just the pagepool? > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > Thanks, > > > Lohit > > > > > > ___ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss
Re: [gpfsug-discuss] sublocks per block in GPFS 5.0
Thanks Mark, I did not know, we could explicitly mention sub-block size when creating File system. It is no-where mentioned in the “man mmcrfs”. Is this a new GPFS 5.0 feature? Also, i see from the “man mmcrfs” that the default sub-block size for 8M and 16M is 16K. +‐‐‐+‐‐‐+ | Block size | Subblock size | +‐‐‐+‐‐‐+ | 64 KiB | 2 KiB | +‐‐‐+‐‐‐+ | 128 KiB | 4 KiB | +‐‐‐+‐‐‐+ | 256 KiB, 512 KiB, 1 MiB, 2 | 8 KiB | | MiB, 4 MiB | | +‐‐‐+‐‐‐+ | 8 MiB, 16 MiB | 16 KiB | +‐‐‐+‐‐‐+ And you could create more than 1024 sub-blocks per block? and 4k is size of sub-block for 16M? That is great, since 4K files will go into data pool, and anything less than 4K will go to system (metadata) pool? Do you think - there would be any performance degradation for reducing the sub-blocks to 4K - 8K, from the default 16K for 16M filesystem? If we are not loosing any blocks by choosing a bigger block-size (16M) for filesystem, why would we want to choose a smaller block-size for filesystem (4M)? What advantage would smaller block-size (4M) give, compared to 16M with performance since 16M filesystem could store small files and read small files too at the respective sizes? And Near Line Rotating disks would be happy with bigger block-size than smaller block-size i guess? Regards, Lohit On Mar 30, 2018, 12:45 PM -0400, Marc A Kaplan, wrote: > > subblock ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] sublocks per block in GPFS 5.0
Hello Everyone, I am a little bit confused with the number of sub-blocks per block-size of 16M in GPFS 5.0. In the below documentation, it mentions that the number of sub-blocks per block is 16K, but "only for Spectrum Scale RAID" https://developer.ibm.com/storage/2018/01/11/spectrum-scale-variant-sub-blocks/ However, when i created the filesystem “without” spectrum scale RAID. I still see that the number of sub-blocks per block is 1024. mmlsfs --subblocks-per-full-block flag value description --- --- --subblocks-per-full-block 1024 Number of subblocks per full block So May i know if the sub-blocks per block-size really 16K? or am i missing something? Regards, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
Thanks, I will try the file heat feature but i am really not sure, if it would work - since the code can access cold files too, and not necessarily files recently accessed/hot files. With respect to LROC. Let me explain as below: The use case is that - The code initially reads headers (small region of data) from thousands of files as the first step. For example about 30,000 of them with each about 300MB to 500MB in size. After the first step, with the help of those headers - it mmaps/seeks across various regions of a set of files in parallel. Since its all small IOs and it was really slow at reading from GPFS over the network directly from disks - Our idea was to use AFM which i believe fetches all file data into flash/ssds, once the initial few blocks of the files are read. But again - AFM seems to not solve the problem, so i want to know if LROC behaves in the same way as AFM, where all of the file data is prefetched in full block size utilizing all the worker threads - if few blocks of the file is read initially. Thanks, Lohit On Feb 22, 2018, 4:52 PM -0500, IBM Spectrum Scale, wrote: > My apologies for not being more clear on the flash storage pool. I meant > that this would be just another GPFS storage pool in the same cluster, so no > separate AFM cache cluster. You would then use the file heat feature to > ensure more frequently accessed files are migrated to that all flash storage > pool. > > As for LROC could you please clarify what you mean by a few headers/stubs of > the file? In reading the LROC documentation and the LROC variables available > in the mmchconfig command I think you might want to take a look a the > lrocDataStubFileSize variable since it seems to apply to your situation. > > Regards, The Spectrum Scale (GPFS) team > > -- > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479. > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in other > countries. > > The forum is informally monitored as time permits and should not be used for > priority messages to the Spectrum Scale (GPFS) team. > > > > From: vall...@cbio.mskcc.org > To: gpfsug main discussion list > Cc: gpfsug-discuss-boun...@spectrumscale.org > Date: 02/22/2018 04:21 PM > Subject: Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > Thank you. > > I am sorry if i was not clear, but the metadata pool is all on SSDs in the > GPFS clusters that we use. Its just the data pool that is on Near-Line > Rotating disks. > I understand that AFM might not be able to solve the issue, and I will try > and see if file heat works for migrating the files to flash tier. > You mentioned an all flash storage pool for heavily used files - so you mean > a different GPFS cluster just with flash storage, and to manually copy the > files to flash storage whenever needed? > The IO performance that i am talking is prominently for reads, so you mention > that LROC can work in the way i want it to? that is prefetch all the files > into LROC cache, after only few headers/stubs of data are read from those > files? > I thought LROC only keeps that block of data that is prefetched from the > disk, and will not prefetch the whole file if a stub of data is read. > Please do let me know, if i understood it wrong. > > On Feb 22, 2018, 4:08 PM -0500, IBM Spectrum Scale , wrote: > I do not think AFM is intended to solve the problem you are trying to solve. > If I understand your scenario correctly you state that you are placing > metadata on NL-SAS storage. If that is true that would not be wise > especially if you are going to do many metadata operations. I suspect your > performance issues are partially due to the fact that metadata is being > stored on NL-SAS storage. You stated that you did not think the file heat > feature would do what you intended but have you tried to use it to see if it > could solve your problem? I would think having metadata on SSD/flash storage > combined with a all flash storage pool for your heavily used files would > perform well. If you expect IO usage will be such that there will be far > more reads than writes then LROC should be beneficial to your overall > performance. > > Regards, The Spectrum Scale (GPFS) team > >
Re: [gpfsug-discuss] GPFS, MMAP and Pagepool
Thanks a lot Sven. I was trying out all the scenarios that Ray mentioned, with respect to lroc and all flash GPFS cluster and nothing seemed to be effective. As of now, we are deploying a new test cluster on GPFS 5.0 and it would be good to know the respective features that could be enabled and see if it improves anything. On the other side, i have seen various cases in my past 6 years with GPFS, where different tools do frequently use mmap. This dates back to 2013.. http://www.spectrumscale.org/pipermail/gpfsug-discuss/2013-May/000253.html when one of my colleagues asked the same question. At that time, it was a homegrown application that was using mmap, along with few other genomic pipelines. An year ago, we had issue with mmap and lot of threads where GPFS would just hang without any traces or logs, which was fixed recently. That was related to relion : https://sbgrid.org/software/titles/relion The issue that we are seeing now is ML/DL workloads, and is related to implementing external tools such as openslide (http://openslide.org/), pytorch (http://pytorch.org/) with field of application being deep learning for thousands of image patches. The IO is really slow when accessed from hard disk, and thus i was trying out other options such as LROC and flash cluster/afm cluster. But everything has a limitation as Ray mentioned. Thanks, Lohit On Feb 22, 2018, 3:59 PM -0500, Sven Oehme, wrote: > Hi Lohit, > > i am working with ray on a mmap performance improvement right now, which most > likely has the same root cause as yours , see --> > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > the thread above is silent after a couple of back and rorth, but ray and i > have active communication in the background and will repost as soon as there > is something new to share. > i am happy to look at this issue after we finish with ray's workload if there > is something missing, but first let's finish his, get you try the same fix > and see if there is something missing. > > btw. if people would share their use of MMAP , what applications they use > (home grown, just use lmdb which uses mmap under the cover, etc) please let > me know so i get a better picture on how wide the usage is with GPFS. i know > a lot of the ML/DL workloads are using it, but i would like to know what else > is out there i might not think about. feel free to drop me a personal note, i > might not reply to it right away, but eventually. > > thx. sven > > > > On Thu, Feb 22, 2018 at 12:33 PM wrote: > > > Hi all, > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with respect > > > to filesystem block-size? > > > Does the efficiency depend on the mmap read size and the block-size of > > > the filesystem even if all the data is cached in pagepool? > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > Here is what i observed: > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB > > > files. > > > > > > The above files are stored on 3 different filesystems. > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data > > > on Near line and metadata on SSDs > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the > > > required files fully cached" from the above GPFS cluster as home. Data > > > and Metadata together on SSDs > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data > > > on Near line and metadata on SSDs > > > > > > When i run the script first time for “each" filesystem: > > > I see that GPFS reads from the files, and caches into the pagepool as it > > > reads, from mmdiag -- iohist > > > > > > When i run the second time, i see that there are no IO requests from the > > > compute node to GPFS NSD servers, which is expected since all the data > > > from the 3 filesystems is cached. > > > > > > However - the time taken for the script to run for the files in the 3 > > > different filesystems is different - although i know that they are just > > > "mmapping"/reading from pagepool/cache and not from disk. > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > 20s 4M block size > > > 15s 1M block size > > > 40S 16M block size. > > > > > > Why do i see a difference when trying to mmap reads from different > > > block-size filesystems, although i see that the IO requests are not > > > hitting disks and just the pagepool? > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > Thanks, > > > Lohit > > > > > > ___ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > ___ > gpfsug-discuss mailing list > gpfsug-discuss at
Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
Thank you. I am sorry if i was not clear, but the metadata pool is all on SSDs in the GPFS clusters that we use. Its just the data pool that is on Near-Line Rotating disks. I understand that AFM might not be able to solve the issue, and I will try and see if file heat works for migrating the files to flash tier. You mentioned an all flash storage pool for heavily used files - so you mean a different GPFS cluster just with flash storage, and to manually copy the files to flash storage whenever needed? The IO performance that i am talking is prominently for reads, so you mention that LROC can work in the way i want it to? that is prefetch all the files into LROC cache, after only few headers/stubs of data are read from those files? I thought LROC only keeps that block of data that is prefetched from the disk, and will not prefetch the whole file if a stub of data is read. Please do let me know, if i understood it wrong. On Feb 22, 2018, 4:08 PM -0500, IBM Spectrum Scale, wrote: > I do not think AFM is intended to solve the problem you are trying to solve. > If I understand your scenario correctly you state that you are placing > metadata on NL-SAS storage. If that is true that would not be wise > especially if you are going to do many metadata operations. I suspect your > performance issues are partially due to the fact that metadata is being > stored on NL-SAS storage. You stated that you did not think the file heat > feature would do what you intended but have you tried to use it to see if it > could solve your problem? I would think having metadata on SSD/flash storage > combined with a all flash storage pool for your heavily used files would > perform well. If you expect IO usage will be such that there will be far > more reads than writes then LROC should be beneficial to your overall > performance. > > Regards, The Spectrum Scale (GPFS) team > > -- > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479. > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in other > countries. > > The forum is informally monitored as time permits and should not be used for > priority messages to the Spectrum Scale (GPFS) team. > > > > From: vall...@cbio.mskcc.org > To: gpfsug main discussion list > Date: 02/22/2018 03:11 PM > Subject: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > Hi All, > > I am trying to figure out a GPFS tiering architecture with flash storage in > front end and near line storage as backend, for Supercomputing > > The Backend storage will be a GPFS storage on near line of about 8-10PB. The > backend storage will/can be tuned to give out large streaming bandwidth and > enough metadata disks to make the stat of all these files fast enough. > > I was thinking if it would be possible to use a GPFS flash cluster or GPFS > SSD cluster in front end that uses AFM and acts as a cache cluster with the > backend GPFS cluster. > > At the end of this .. the workflow that i am targeting is where: > > > “ > If the compute nodes read headers of thousands of large files ranging from > 100MB to 1GB, the AFM cluster should be able to bring up enough threads to > bring up all of the files from the backend to the faster SSD/Flash GPFS > cluster. > The working set might be about 100T, at a time which i want to be on a > faster/low latency tier, and the rest of the files to be in slower tier until > they are read by the compute nodes. > “ > > > I do not want to use GPFS policies to achieve the above, is because i am not > sure - if policies could be written in a way, that files are moved from the > slower tier to faster tier depending on how the jobs interact with the files. > I know that the policies could be written depending on the heat, and > size/format but i don’t think thes policies work in a similar way as above. > > I did try the above architecture, where an SSD GPFS cluster acts as an AFM > cache cluster before the near line storage. However the AFM cluster was > really really slow, It took it about few hours to copy the files from near > line storage to AFM cache cluster. > I am not sure if AFM is not designed to work this way, or if AFM is not tuned > to work as fast as it should. > > I have tried LROC too, but it does not behave the same way as i guess AFM > works. > > Has anyone tried or know if GPFS supports an architecture - where
[gpfsug-discuss] GPFS, MMAP and Pagepool
Hi all, I wanted to know, how does mmap interact with GPFS pagepool with respect to filesystem block-size? Does the efficiency depend on the mmap read size and the block-size of the filesystem even if all the data is cached in pagepool? GPFS 4.2.3.2 and CentOS7. Here is what i observed: I was testing a user script that uses mmap to read from 100M to 500MB files. The above files are stored on 3 different filesystems. Compute nodes - 10G pagepool and 5G seqdiscardthreshold. 1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required files fully cached" from the above GPFS cluster as home. Data and Metadata together on SSDs 3. 16M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs When i run the script first time for “each" filesystem: I see that GPFS reads from the files, and caches into the pagepool as it reads, from mmdiag -- iohist When i run the second time, i see that there are no IO requests from the compute node to GPFS NSD servers, which is expected since all the data from the 3 filesystems is cached. However - the time taken for the script to run for the files in the 3 different filesystems is different - although i know that they are just "mmapping"/reading from pagepool/cache and not from disk. Here is the difference in time, for IO just from pagepool: 20s 4M block size 15s 1M block size 40S 16M block size. Why do i see a difference when trying to mmap reads from different block-size filesystems, although i see that the IO requests are not hitting disks and just the pagepool? I am willing to share the strace output and mmdiag outputs if needed. Thanks, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
[gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
Hi All, I am trying to figure out a GPFS tiering architecture with flash storage in front end and near line storage as backend, for Supercomputing The Backend storage will be a GPFS storage on near line of about 8-10PB. The backend storage will/can be tuned to give out large streaming bandwidth and enough metadata disks to make the stat of all these files fast enough. I was thinking if it would be possible to use a GPFS flash cluster or GPFS SSD cluster in front end that uses AFM and acts as a cache cluster with the backend GPFS cluster. At the end of this .. the workflow that i am targeting is where: “ If the compute nodes read headers of thousands of large files ranging from 100MB to 1GB, the AFM cluster should be able to bring up enough threads to bring up all of the files from the backend to the faster SSD/Flash GPFS cluster. The working set might be about 100T, at a time which i want to be on a faster/low latency tier, and the rest of the files to be in slower tier until they are read by the compute nodes. “ I do not want to use GPFS policies to achieve the above, is because i am not sure - if policies could be written in a way, that files are moved from the slower tier to faster tier depending on how the jobs interact with the files. I know that the policies could be written depending on the heat, and size/format but i don’t think thes policies work in a similar way as above. I did try the above architecture, where an SSD GPFS cluster acts as an AFM cache cluster before the near line storage. However the AFM cluster was really really slow, It took it about few hours to copy the files from near line storage to AFM cache cluster. I am not sure if AFM is not designed to work this way, or if AFM is not tuned to work as fast as it should. I have tried LROC too, but it does not behave the same way as i guess AFM works. Has anyone tried or know if GPFS supports an architecture - where the fast tier can bring up thousands of threads and copy the files almost instantly/asynchronously from the slow tier, whenever the jobs from compute nodes reads few blocks from these files? I understand that with respect to hardware - the AFM cluster should be really fast, as well as the network between the AFM cluster and the backend cluster. Please do also let me know, if the above workflow can be done using GPFS policies and be as fast as it is needed to be. Regards, Lohit ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss