Hey Sven,

This is regarding mmap issues and GPFS.
We had discussed previously of experimenting with GPFS 5.

I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2

I am yet to experiment with mmap performance, but before that - I am seeing 
weird hangs with GPFS 5 and I think it could be related to mmap.

Have you seen GPFS ever hang on this syscall?
[Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>] 
_ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]

I see the above ,when kernel hangs and throws out a series of trace calls.

I somehow think the above trace is related to processes hanging on GPFS 
forever. There are no errors in GPFS however.

Also, I think the above happens only when the mmap threads go above a 
particular number.

We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 
. At that time , the issue happened when mmap threads go more than 
worker1threads. According to the ticket - it was a mmap race condition that 
GPFS was not handling well.

I am not sure if this issue is a repeat and I am yet to isolate the incident 
and test with increasing number of mmap threads.

I am not 100 percent sure if this is related to mmap yet but just wanted to ask 
you if you have seen anything like above.

Thanks,

Lohit

On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <oeh...@gmail.com>, wrote:
> Hi Lohit,
>
> i am working with ray on a mmap performance improvement right now, which most 
> likely has the same root cause as yours , see -->  
> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> the thread above is silent after a couple of back and rorth, but ray and i 
> have active communication in the background and will repost as soon as there 
> is something new to share.
> i am happy to look at this issue after we finish with ray's workload if there 
> is something missing, but first let's finish his, get you try the same fix 
> and see if there is something missing.
>
> btw. if people would share their use of MMAP , what applications they use 
> (home grown, just use lmdb which uses mmap under the cover, etc) please let 
> me know so i get a better picture on how wide the usage is with GPFS. i know 
> a lot of the ML/DL workloads are using it, but i would like to know what else 
> is out there i might not think about. feel free to drop me a personal note, i 
> might not reply to it right away, but eventually.
>
> thx. sven
>
>
> > On Thu, Feb 22, 2018 at 12:33 PM <vall...@cbio.mskcc.org> wrote:
> > > Hi all,
> > >
> > > I wanted to know, how does mmap interact with GPFS pagepool with respect 
> > > to filesystem block-size?
> > > Does the efficiency depend on the mmap read size and the block-size of 
> > > the filesystem even if all the data is cached in pagepool?
> > >
> > > GPFS 4.2.3.2 and CentOS7.
> > >
> > > Here is what i observed:
> > >
> > > I was testing a user script that uses mmap to read from 100M to 500MB 
> > > files.
> > >
> > > The above files are stored on 3 different filesystems.
> > >
> > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > >
> > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the 
> > > required files fully cached" from the above GPFS cluster as home. Data 
> > > and Metadata together on SSDs
> > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > >
> > > When i run the script first time for “each" filesystem:
> > > I see that GPFS reads from the files, and caches into the pagepool as it 
> > > reads, from mmdiag -- iohist
> > >
> > > When i run the second time, i see that there are no IO requests from the 
> > > compute node to GPFS NSD servers, which is expected since all the data 
> > > from the 3 filesystems is cached.
> > >
> > > However - the time taken for the script to run for the files in the 3 
> > > different filesystems is different - although i know that they are just 
> > > "mmapping"/reading from pagepool/cache and not from disk.
> > >
> > > Here is the difference in time, for IO just from pagepool:
> > >
> > > 20s 4M block size
> > > 15s 1M block size
> > > 40S 16M block size.
> > >
> > > Why do i see a difference when trying to mmap reads from different 
> > > block-size filesystems, although i see that the IO requests are not 
> > > hitting disks and just the pagepool?
> > >
> > > I am willing to share the strace output and mmdiag outputs if needed.
> > >
> > > Thanks,
> > > Lohit
> > >
> > > _______________________________________________
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to