Hello All,

This is a continuation to the previous discussion that i had with Sven.
However against what i had mentioned previously - i realize that this is “not” 
related to mmap, and i see it when doing random freads.

I see that block-size of the filesystem matters when reading from Page pool.
I see a major difference in performance when compared 1M to 16M, when doing lot 
of random small freads with all of the data in pagepool.

Performance for 1M is a magnitude “more” than the performance that i see for 
16M.

The GPFS that we have currently is :
Version : 5.0.1-0.5
Filesystem version: 19.01 (5.0.1.0)
Block-size : 16M

I had made the filesystem block-size to be 16M, thinking that i would get the 
most performance for both random/sequential reads from 16M than the smaller 
block-sizes.
With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose 
lot of storage space even with 16M.
I had run few benchmarks and i did see that 16M was performing better “when 
hitting storage/disks” with respect to bandwidth for random/sequential on 
small/large reads.

However, with this particular workload - where it freads a chunk of data 
randomly from hundreds of files -> I see that the number of page-faults 
increase with block-size and actually reduce the performance.
1M performs a lot better than 16M, and may be i will get better performance 
with less than 1M.
It gives the best performance when reading from local disk, with 4K block size 
filesystem.

What i mean by performance when it comes to this workload - is not the 
bandwidth but the amount of time that it takes to do each iteration/read batch 
of data.

I figure what is happening is:
fread is trying to read a full block size of 16M - which is good in a way, when 
it hits the hard disk.
But the application could be using just a small part of that 16M. Thus when 
randomly reading(freads) lot of data of 16M chunk size - it is page faulting a 
lot more and causing the performance to drop .
I could try to make the application do read instead of freads, but i fear that 
could be bad too since it might be hitting the disk with a very small block 
size and that is not good.

With the way i see things now -
I believe it could be best if the application does random reads of 4k/1M from 
pagepool but some how does 16M from rotating disks.

I don’t see any way of doing the above other than following a different 
approach where i create a filesystem with a smaller block size ( 1M or less 
than 1M ), on SSDs as a tier.

May i please ask for advise, if what i am understanding/seeing is right and the 
best solution possible for the above scenario.

Regards,
Lohit

On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru <vall...@cbio.mskcc.org>, wrote:
> Hey Sven,
>
> This is regarding mmap issues and GPFS.
> We had discussed previously of experimenting with GPFS 5.
>
> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2
>
> I am yet to experiment with mmap performance, but before that - I am seeing 
> weird hangs with GPFS 5 and I think it could be related to mmap.
>
> Have you seen GPFS ever hang on this syscall?
> [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>] 
> _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]
>
> I see the above ,when kernel hangs and throws out a series of trace calls.
>
> I somehow think the above trace is related to processes hanging on GPFS 
> forever. There are no errors in GPFS however.
>
> Also, I think the above happens only when the mmap threads go above a 
> particular number.
>
> We had faced a similar issue in 4.2.3 and it was resolved in a patch to 
> 4.2.3.2 . At that time , the issue happened when mmap threads go more than 
> worker1threads. According to the ticket - it was a mmap race condition that 
> GPFS was not handling well.
>
> I am not sure if this issue is a repeat and I am yet to isolate the incident 
> and test with increasing number of mmap threads.
>
> I am not 100 percent sure if this is related to mmap yet but just wanted to 
> ask you if you have seen anything like above.
>
> Thanks,
>
> Lohit
>
> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <oeh...@gmail.com>, wrote:
> > Hi Lohit,
> >
> > i am working with ray on a mmap performance improvement right now, which 
> > most likely has the same root cause as yours , see -->  
> > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> > the thread above is silent after a couple of back and rorth, but ray and i 
> > have active communication in the background and will repost as soon as 
> > there is something new to share.
> > i am happy to look at this issue after we finish with ray's workload if 
> > there is something missing, but first let's finish his, get you try the 
> > same fix and see if there is something missing.
> >
> > btw. if people would share their use of MMAP , what applications they use 
> > (home grown, just use lmdb which uses mmap under the cover, etc) please let 
> > me know so i get a better picture on how wide the usage is with GPFS. i 
> > know a lot of the ML/DL workloads are using it, but i would like to know 
> > what else is out there i might not think about. feel free to drop me a 
> > personal note, i might not reply to it right away, but eventually.
> >
> > thx. sven
> >
> >
> > > On Thu, Feb 22, 2018 at 12:33 PM <vall...@cbio.mskcc.org> wrote:
> > > > Hi all,
> > > >
> > > > I wanted to know, how does mmap interact with GPFS pagepool with 
> > > > respect to filesystem block-size?
> > > > Does the efficiency depend on the mmap read size and the block-size of 
> > > > the filesystem even if all the data is cached in pagepool?
> > > >
> > > > GPFS 4.2.3.2 and CentOS7.
> > > >
> > > > Here is what i observed:
> > > >
> > > > I was testing a user script that uses mmap to read from 100M to 500MB 
> > > > files.
> > > >
> > > > The above files are stored on 3 different filesystems.
> > > >
> > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > > >
> > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data 
> > > > on Near line and metadata on SSDs
> > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the 
> > > > required files fully cached" from the above GPFS cluster as home. Data 
> > > > and Metadata together on SSDs
> > > > 3. 16M block size GPFS filesystem, with separate metadata and data. 
> > > > Data on Near line and metadata on SSDs
> > > >
> > > > When i run the script first time for “each" filesystem:
> > > > I see that GPFS reads from the files, and caches into the pagepool as 
> > > > it reads, from mmdiag -- iohist
> > > >
> > > > When i run the second time, i see that there are no IO requests from 
> > > > the compute node to GPFS NSD servers, which is expected since all the 
> > > > data from the 3 filesystems is cached.
> > > >
> > > > However - the time taken for the script to run for the files in the 3 
> > > > different filesystems is different - although i know that they are just 
> > > > "mmapping"/reading from pagepool/cache and not from disk.
> > > >
> > > > Here is the difference in time, for IO just from pagepool:
> > > >
> > > > 20s 4M block size
> > > > 15s 1M block size
> > > > 40S 16M block size.
> > > >
> > > > Why do i see a difference when trying to mmap reads from different 
> > > > block-size filesystems, although i see that the IO requests are not 
> > > > hitting disks and just the pagepool?
> > > >
> > > > I am willing to share the strace output and mmdiag outputs if needed.
> > > >
> > > > Thanks,
> > > > Lohit
> > > >
> > > > _______________________________________________
> > > > gpfsug-discuss mailing list
> > > > gpfsug-discuss at spectrumscale.org
> > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to