Re: [gpfsug-discuss] GPFS, MMAP and Pagepool

2018-04-11 Thread Lohit Valleru
Hey Sven,

This is regarding mmap issues and GPFS.
We had discussed previously of experimenting with GPFS 5.

I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2

I am yet to experiment with mmap performance, but before that - I am seeing 
weird hangs with GPFS 5 and I think it could be related to mmap.

Have you seen GPFS ever hang on this syscall?
[Tue Apr 10 04:20:13 2018] [] 
_ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]

I see the above ,when kernel hangs and throws out a series of trace calls.

I somehow think the above trace is related to processes hanging on GPFS 
forever. There are no errors in GPFS however.

Also, I think the above happens only when the mmap threads go above a 
particular number.

We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 
. At that time , the issue happened when mmap threads go more than 
worker1threads. According to the ticket - it was a mmap race condition that 
GPFS was not handling well.

I am not sure if this issue is a repeat and I am yet to isolate the incident 
and test with increasing number of mmap threads.

I am not 100 percent sure if this is related to mmap yet but just wanted to ask 
you if you have seen anything like above.

Thanks,

Lohit

On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote:
> Hi Lohit,
>
> i am working with ray on a mmap performance improvement right now, which most 
> likely has the same root cause as yours , see -->  
> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> the thread above is silent after a couple of back and rorth, but ray and i 
> have active communication in the background and will repost as soon as there 
> is something new to share.
> i am happy to look at this issue after we finish with ray's workload if there 
> is something missing, but first let's finish his, get you try the same fix 
> and see if there is something missing.
>
> btw. if people would share their use of MMAP , what applications they use 
> (home grown, just use lmdb which uses mmap under the cover, etc) please let 
> me know so i get a better picture on how wide the usage is with GPFS. i know 
> a lot of the ML/DL workloads are using it, but i would like to know what else 
> is out there i might not think about. feel free to drop me a personal note, i 
> might not reply to it right away, but eventually.
>
> thx. sven
>
>
> > On Thu, Feb 22, 2018 at 12:33 PM  wrote:
> > > Hi all,
> > >
> > > I wanted to know, how does mmap interact with GPFS pagepool with respect 
> > > to filesystem block-size?
> > > Does the efficiency depend on the mmap read size and the block-size of 
> > > the filesystem even if all the data is cached in pagepool?
> > >
> > > GPFS 4.2.3.2 and CentOS7.
> > >
> > > Here is what i observed:
> > >
> > > I was testing a user script that uses mmap to read from 100M to 500MB 
> > > files.
> > >
> > > The above files are stored on 3 different filesystems.
> > >
> > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > >
> > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the 
> > > required files fully cached" from the above GPFS cluster as home. Data 
> > > and Metadata together on SSDs
> > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > >
> > > When i run the script first time for “each" filesystem:
> > > I see that GPFS reads from the files, and caches into the pagepool as it 
> > > reads, from mmdiag -- iohist
> > >
> > > When i run the second time, i see that there are no IO requests from the 
> > > compute node to GPFS NSD servers, which is expected since all the data 
> > > from the 3 filesystems is cached.
> > >
> > > However - the time taken for the script to run for the files in the 3 
> > > different filesystems is different - although i know that they are just 
> > > "mmapping"/reading from pagepool/cache and not from disk.
> > >
> > > Here is the difference in time, for IO just from pagepool:
> > >
> > > 20s 4M block size
> > > 15s 1M block size
> > > 40S 16M block size.
> > >
> > > Why do i see a difference when trying to mmap reads from different 
> > > block-size filesystems, although i see that the IO requests are not 
> > > hitting disks and just the pagepool?
> > >
> > > I am willing to share the strace output and mmdiag outputs if needed.
> > >
> > > Thanks,
> > > Lohit
> > >
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss 

Re: [gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread valleru
Thanks a lot Sven.
I was trying out all the scenarios that Ray mentioned, with respect to lroc and 
all flash GPFS cluster and nothing seemed to be effective.

As of now, we are deploying a new test cluster on GPFS 5.0 and it would be good 
to know the respective features that could be enabled and see if it improves 
anything.

On the other side, i have seen various cases in my past 6 years with GPFS, 
where different tools do frequently use mmap. This dates back to 2013.. 
http://www.spectrumscale.org/pipermail/gpfsug-discuss/2013-May/000253.html when 
one of my colleagues asked the same question. At that time, it was a homegrown 
application that was using mmap, along with few other genomic pipelines.
An year ago, we had issue with mmap and lot of threads where GPFS would just 
hang without any traces or logs, which was fixed recently. That was related to 
relion :
https://sbgrid.org/software/titles/relion

The issue that we are seeing now is ML/DL workloads, and is related to 
implementing external tools such as openslide (http://openslide.org/), pytorch 
(http://pytorch.org/) with field of application being deep learning for 
thousands of image patches.

The IO is really slow when accessed from hard disk, and thus i was trying out 
other options such as LROC and flash cluster/afm cluster. But everything has a 
limitation as Ray mentioned.

Thanks,
Lohit

On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote:
> Hi Lohit,
>
> i am working with ray on a mmap performance improvement right now, which most 
> likely has the same root cause as yours , see -->  
> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> the thread above is silent after a couple of back and rorth, but ray and i 
> have active communication in the background and will repost as soon as there 
> is something new to share.
> i am happy to look at this issue after we finish with ray's workload if there 
> is something missing, but first let's finish his, get you try the same fix 
> and see if there is something missing.
>
> btw. if people would share their use of MMAP , what applications they use 
> (home grown, just use lmdb which uses mmap under the cover, etc) please let 
> me know so i get a better picture on how wide the usage is with GPFS. i know 
> a lot of the ML/DL workloads are using it, but i would like to know what else 
> is out there i might not think about. feel free to drop me a personal note, i 
> might not reply to it right away, but eventually.
>
> thx. sven
>
>
> > On Thu, Feb 22, 2018 at 12:33 PM  wrote:
> > > Hi all,
> > >
> > > I wanted to know, how does mmap interact with GPFS pagepool with respect 
> > > to filesystem block-size?
> > > Does the efficiency depend on the mmap read size and the block-size of 
> > > the filesystem even if all the data is cached in pagepool?
> > >
> > > GPFS 4.2.3.2 and CentOS7.
> > >
> > > Here is what i observed:
> > >
> > > I was testing a user script that uses mmap to read from 100M to 500MB 
> > > files.
> > >
> > > The above files are stored on 3 different filesystems.
> > >
> > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > >
> > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the 
> > > required files fully cached" from the above GPFS cluster as home. Data 
> > > and Metadata together on SSDs
> > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > >
> > > When i run the script first time for “each" filesystem:
> > > I see that GPFS reads from the files, and caches into the pagepool as it 
> > > reads, from mmdiag -- iohist
> > >
> > > When i run the second time, i see that there are no IO requests from the 
> > > compute node to GPFS NSD servers, which is expected since all the data 
> > > from the 3 filesystems is cached.
> > >
> > > However - the time taken for the script to run for the files in the 3 
> > > different filesystems is different - although i know that they are just 
> > > "mmapping"/reading from pagepool/cache and not from disk.
> > >
> > > Here is the difference in time, for IO just from pagepool:
> > >
> > > 20s 4M block size
> > > 15s 1M block size
> > > 40S 16M block size.
> > >
> > > Why do i see a difference when trying to mmap reads from different 
> > > block-size filesystems, although i see that the IO requests are not 
> > > hitting disks and just the pagepool?
> > >
> > > I am willing to share the strace output and mmdiag outputs if needed.
> > >
> > > Thanks,
> > > Lohit
> > >
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at 

Re: [gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread Sven Oehme
Hi Lohit,

i am working with ray on a mmap performance improvement right now, which
most likely has the same root cause as yours , see -->
http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
the thread above is silent after a couple of back and rorth, but ray and i
have active communication in the background and will repost as soon as
there is something new to share.
i am happy to look at this issue after we finish with ray's workload if
there is something missing, but first let's finish his, get you try the
same fix and see if there is something missing.

btw. if people would share their use of MMAP , what applications they use
(home grown, just use lmdb which uses mmap under the cover, etc) please let
me know so i get a better picture on how wide the usage is with GPFS. i
know a lot of the ML/DL workloads are using it, but i would like to know
what else is out there i might not think about. feel free to drop me a
personal note, i might not reply to it right away, but eventually.

thx. sven


On Thu, Feb 22, 2018 at 12:33 PM  wrote:

> Hi all,
>
> I wanted to know, how does mmap interact with GPFS pagepool with respect
> to filesystem block-size?
> Does the efficiency depend on the mmap read size and the block-size of the
> filesystem even if all the data is cached in pagepool?
>
> GPFS 4.2.3.2 and CentOS7.
>
> Here is what i observed:
>
> I was testing a user script that uses mmap to read from 100M to 500MB
> files.
>
> The above files are stored on 3 different filesystems.
>
> Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
>
> 1. 4M block size GPFS filesystem, with separate metadata and data. Data on
> Near line and metadata on SSDs
> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the
> required files fully cached" from the above GPFS cluster as home. Data and
> Metadata together on SSDs
> 3. 16M block size GPFS filesystem, with separate metadata and data. Data
> on Near line and metadata on SSDs
>
> When i run the script first time for “each" filesystem:
> I see that GPFS reads from the files, and caches into the pagepool as it
> reads, from mmdiag -- iohist
>
> When i run the second time, i see that there are no IO requests from the
> compute node to GPFS NSD servers, which is expected since all the data from
> the 3 filesystems is cached.
>
> However - the time taken for the script to run for the files in the 3
> different filesystems is different - although i know that they are just
> "mmapping"/reading from pagepool/cache and not from disk.
>
> Here is the difference in time, for IO just from pagepool:
>
> 20s 4M block size
> 15s 1M block size
> 40S 16M block size.
>
> Why do i see a difference when trying to mmap reads from different
> block-size filesystems, although i see that the IO requests are not hitting
> disks and just the pagepool?
>
> I am willing to share the strace output and mmdiag outputs if needed.
>
> Thanks,
> Lohit
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread valleru
Hi all,

I wanted to know, how does mmap interact with GPFS pagepool with respect to 
filesystem block-size?
Does the efficiency depend on the mmap read size and the block-size of the 
filesystem even if all the data is cached in pagepool?

GPFS 4.2.3.2 and CentOS7.

Here is what i observed:

I was testing a user script that uses mmap to read from 100M to 500MB files.

The above files are stored on 3 different filesystems.

Compute nodes - 10G pagepool and 5G seqdiscardthreshold.

1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near 
line and metadata on SSDs
2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required 
files fully cached" from the above GPFS cluster as home. Data and Metadata 
together on SSDs
3. 16M block size GPFS filesystem, with separate metadata and data. Data on 
Near line and metadata on SSDs

When i run the script first time for “each" filesystem:
I see that GPFS reads from the files, and caches into the pagepool as it reads, 
from mmdiag -- iohist

When i run the second time, i see that there are no IO requests from the 
compute node to GPFS NSD servers, which is expected since all the data from the 
3 filesystems is cached.

However - the time taken for the script to run for the files in the 3 different 
filesystems is different - although i know that they are just 
"mmapping"/reading from pagepool/cache and not from disk.

Here is the difference in time, for IO just from pagepool:

20s 4M block size
15s 1M block size
40S 16M block size.

Why do i see a difference when trying to mmap reads from different block-size 
filesystems, although i see that the IO requests are not hitting disks and just 
the pagepool?

I am willing to share the strace output and mmdiag outputs if needed.

Thanks,
Lohit

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss