Hey Sven, This is regarding mmap issues and GPFS. We had discussed previously of experimenting with GPFS 5.
I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 I am yet to experiment with mmap performance, but before that - I am seeing weird hangs with GPFS 5 and I think it could be related to mmap. Have you seen GPFS ever hang on this syscall? [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>] _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] I see the above ,when kernel hangs and throws out a series of trace calls. I somehow think the above trace is related to processes hanging on GPFS forever. There are no errors in GPFS however. Also, I think the above happens only when the mmap threads go above a particular number. We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 . At that time , the issue happened when mmap threads go more than worker1threads. According to the ticket - it was a mmap race condition that GPFS was not handling well. I am not sure if this issue is a repeat and I am yet to isolate the incident and test with increasing number of mmap threads. I am not 100 percent sure if this is related to mmap yet but just wanted to ask you if you have seen anything like above. Thanks, Lohit On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <[email protected]>, wrote: > Hi Lohit, > > i am working with ray on a mmap performance improvement right now, which most > likely has the same root cause as yours , see --> > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > the thread above is silent after a couple of back and rorth, but ray and i > have active communication in the background and will repost as soon as there > is something new to share. > i am happy to look at this issue after we finish with ray's workload if there > is something missing, but first let's finish his, get you try the same fix > and see if there is something missing. > > btw. if people would share their use of MMAP , what applications they use > (home grown, just use lmdb which uses mmap under the cover, etc) please let > me know so i get a better picture on how wide the usage is with GPFS. i know > a lot of the ML/DL workloads are using it, but i would like to know what else > is out there i might not think about. feel free to drop me a personal note, i > might not reply to it right away, but eventually. > > thx. sven > > > > On Thu, Feb 22, 2018 at 12:33 PM <[email protected]> wrote: > > > Hi all, > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with respect > > > to filesystem block-size? > > > Does the efficiency depend on the mmap read size and the block-size of > > > the filesystem even if all the data is cached in pagepool? > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > Here is what i observed: > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB > > > files. > > > > > > The above files are stored on 3 different filesystems. > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data > > > on Near line and metadata on SSDs > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the > > > required files fully cached" from the above GPFS cluster as home. Data > > > and Metadata together on SSDs > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data > > > on Near line and metadata on SSDs > > > > > > When i run the script first time for “each" filesystem: > > > I see that GPFS reads from the files, and caches into the pagepool as it > > > reads, from mmdiag -- iohist > > > > > > When i run the second time, i see that there are no IO requests from the > > > compute node to GPFS NSD servers, which is expected since all the data > > > from the 3 filesystems is cached. > > > > > > However - the time taken for the script to run for the files in the 3 > > > different filesystems is different - although i know that they are just > > > "mmapping"/reading from pagepool/cache and not from disk. > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > 20s 4M block size > > > 15s 1M block size > > > 40S 16M block size. > > > > > > Why do i see a difference when trying to mmap reads from different > > > block-size filesystems, although i see that the IO requests are not > > > hitting disks and just the pagepool? > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > Thanks, > > > Lohit > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
