Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

valleru Wed, 19 Sep 2018 11:17:33 -0700

Thanks Sven.
I will disable it completely and see how it behaves.

Is this the presentation?
http://files.gpfsug.org/presentations/2014/UG10_GPFS_Performance_Session_v10.pdf


I guess i read it, but it did not strike me at this situation. I will try to 
read it again and see if i could make use of it.

Regards,
Lohit

On Sep 19, 2018, 2:12 PM -0400, Sven Oehme <[email protected]>, wrote:
> seem like you never read my performance presentation from a few years ago ;-)
>
> you can control this on a per node basis , either for all i/o :
>
>    prefetchAggressiveness = X
>
> or individual for reads or writes :
>
>    prefetchAggressivenessRead = X
>    prefetchAggressivenessWrite = X
>
> for a start i would turn it off completely via :
>
> mmchconfig prefetchAggressiveness=0 -I -N nodename
>
> that will turn it off only for that node and only until you restart the node.
> then see what happens
>
> sven
>
>
> > On Wed, Sep 19, 2018 at 11:07 AM <[email protected]> wrote:
> > > Thank you Sven.
> > >
> > > I mostly think it could be 1. or some other issue.
> > > I don’t think it could be 2. , because i can replicate this issue no 
> > > matter what is the size of the dataset. It happens for few files that 
> > > could easily fit in the page pool too.
> > >
> > > I do see a lot more page faults for 16M compared to 1M, so it could be 
> > > related to many threads trying to compete for the same buffer space.
> > >
> > > I will try to take the trace with trace=io option and see if can find 
> > > something.
> > >
> > > How do i turn of prefetching? Can i turn it off for a single node/client?
> > >
> > > Regards,
> > > Lohit
> > >
> > > On Sep 18, 2018, 5:23 PM -0400, Sven Oehme <[email protected]>, wrote:
> > > > Hi,
> > > >
> > > > taking a trace would tell for sure, but i suspect what you might be 
> > > > hitting one or even multiple issues which have similar negative 
> > > > performance impacts but different root causes.
> > > >
> > > > 1. this could be serialization around buffer locks. as larger your 
> > > > blocksize gets as larger is the amount of data one of this pagepool 
> > > > buffers will maintain, if there is a lot of concurrency on smaller 
> > > > amount of data more threads potentially compete for the same buffer 
> > > > lock to copy stuff in and out of a particular buffer, hence things go 
> > > > slower compared to the same amount of data spread across more buffers, 
> > > > each of smaller size.
> > > >
> > > > 2. your data set is small'ish, lets say a couple of time bigger than 
> > > > the pagepool and you random access it with multiple threads. what will 
> > > > happen is that because it doesn't fit into the cache it will be read 
> > > > from the backend. if multiple threads hit the same 16 mb block at once 
> > > > with multiple 4k random reads, it will read the whole 16mb block 
> > > > because it thinks it will benefit from it later on out of cache, but 
> > > > because it fully random the same happens with the next block and the 
> > > > next and so on and before you get back to this block it was pushed out 
> > > > of the cache because of lack of enough pagepool.
> > > >
> > > > i could think of multiple other scenarios , which is why its so hard to 
> > > > accurately benchmark an application because you will design a benchmark 
> > > > to test an application, but it actually almost always behaves different 
> > > > then you think it does :-)
> > > >
> > > > so best is to run the real application and see under which 
> > > > configuration it works best.
> > > >
> > > > you could also take a trace with trace=io and then look at
> > > >
> > > > TRACE_VNOP: READ:
> > > > TRACE_VNOP: WRITE:
> > > >
> > > > and compare them to
> > > >
> > > > TRACE_IO: QIO: read
> > > > TRACE_IO: QIO: write
> > > >
> > > > and see if the numbers summed up for both are somewhat equal. if 
> > > > TRACE_VNOP is significant smaller than TRACE_IO you most likely do more 
> > > > i/o than you should and turning prefetching off might actually make 
> > > > things faster .
> > > >
> > > > keep in mind i am no longer working for IBM so all i say might be 
> > > > obsolete by now, i no longer have access to the one and only truth aka 
> > > > the source code ... but if i am wrong i am sure somebody will point 
> > > > this out soon ;-)
> > > >
> > > > sven
> > > >
> > > >
> > > >
> > > >
> > > > > On Tue, Sep 18, 2018 at 10:31 AM <[email protected]> wrote:
> > > > > > Hello All,
> > > > > >
> > > > > > This is a continuation to the previous discussion that i had with 
> > > > > > Sven.
> > > > > > However against what i had mentioned previously - i realize that 
> > > > > > this is “not” related to mmap, and i see it when doing random 
> > > > > > freads.
> > > > > >
> > > > > > I see that block-size of the filesystem matters when reading from 
> > > > > > Page pool.
> > > > > > I see a major difference in performance when compared 1M to 16M, 
> > > > > > when doing lot of random small freads with all of the data in 
> > > > > > pagepool.
> > > > > >
> > > > > > Performance for 1M is a magnitude “more” than the performance that 
> > > > > > i see for 16M.
> > > > > >
> > > > > > The GPFS that we have currently is :
> > > > > > Version : 5.0.1-0.5
> > > > > > Filesystem version: 19.01 (5.0.1.0)
> > > > > > Block-size : 16M
> > > > > >
> > > > > > I had made the filesystem block-size to be 16M, thinking that i 
> > > > > > would get the most performance for both random/sequential reads 
> > > > > > from 16M than the smaller block-sizes.
> > > > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and 
> > > > > > thus not loose lot of storage space even with 16M.
> > > > > > I had run few benchmarks and i did see that 16M was performing 
> > > > > > better “when hitting storage/disks” with respect to bandwidth for 
> > > > > > random/sequential on small/large reads.
> > > > > >
> > > > > > However, with this particular workload - where it freads a chunk of 
> > > > > > data randomly from hundreds of files -> I see that the number of 
> > > > > > page-faults increase with block-size and actually reduce the 
> > > > > > performance.
> > > > > > 1M performs a lot better than 16M, and may be i will get better 
> > > > > > performance with less than 1M.
> > > > > > It gives the best performance when reading from local disk, with 4K 
> > > > > > block size filesystem.
> > > > > >
> > > > > > What i mean by performance when it comes to this workload - is not 
> > > > > > the bandwidth but the amount of time that it takes to do each 
> > > > > > iteration/read batch of data.
> > > > > >
> > > > > > I figure what is happening is:
> > > > > > fread is trying to read a full block size of 16M - which is good in 
> > > > > > a way, when it hits the hard disk.
> > > > > > But the application could be using just a small part of that 16M. 
> > > > > > Thus when randomly reading(freads) lot of data of 16M chunk size - 
> > > > > > it is page faulting a lot more and causing the performance to drop .
> > > > > > I could try to make the application do read instead of freads, but 
> > > > > > i fear that could be bad too since it might be hitting the disk 
> > > > > > with a very small block size and that is not good.
> > > > > >
> > > > > > With the way i see things now -
> > > > > > I believe it could be best if the application does random reads of 
> > > > > > 4k/1M from pagepool but some how does 16M from rotating disks.
> > > > > >
> > > > > > I don’t see any way of doing the above other than following a 
> > > > > > different approach where i create a filesystem with a smaller block 
> > > > > > size ( 1M or less than 1M ), on SSDs as a tier.
> > > > > >
> > > > > > May i please ask for advise, if what i am understanding/seeing is 
> > > > > > right and the best solution possible for the above scenario.
> > > > > >
> > > > > > Regards,
> > > > > > Lohit
> > > > > >
> > > > > > On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru 
> > > > > > <[email protected]>, wrote:
> > > > > > > Hey Sven,
> > > > > > >
> > > > > > > This is regarding mmap issues and GPFS.
> > > > > > > We had discussed previously of experimenting with GPFS 5.
> > > > > > >
> > > > > > > I now have upgraded all of compute nodes and NSD nodes to GPFS 
> > > > > > > 5.0.0.2
> > > > > > >
> > > > > > > I am yet to experiment with mmap performance, but before that - I 
> > > > > > > am seeing weird hangs with GPFS 5 and I think it could be related 
> > > > > > > to mmap.
> > > > > > >
> > > > > > > Have you seen GPFS ever hang on this syscall?
> > > > > > > [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>] 
> > > > > > > _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26]
> > > > > > >
> > > > > > > I see the above ,when kernel hangs and throws out a series of 
> > > > > > > trace calls.
> > > > > > >
> > > > > > > I somehow think the above trace is related to processes hanging 
> > > > > > > on GPFS forever. There are no errors in GPFS however.
> > > > > > >
> > > > > > > Also, I think the above happens only when the mmap threads go 
> > > > > > > above a particular number.
> > > > > > >
> > > > > > > We had faced a similar issue in 4.2.3 and it was resolved in a 
> > > > > > > patch to 4.2.3.2 . At that time , the issue happened when mmap 
> > > > > > > threads go more than worker1threads. According to the ticket - it 
> > > > > > > was a mmap race condition that GPFS was not handling well.
> > > > > > >
> > > > > > > I am not sure if this issue is a repeat and I am yet to isolate 
> > > > > > > the incident and test with increasing number of mmap threads.
> > > > > > >
> > > > > > > I am not 100 percent sure if this is related to mmap yet but just 
> > > > > > > wanted to ask you if you have seen anything like above.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Lohit
> > > > > > >
> > > > > > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <[email protected]>, 
> > > > > > > wrote:
> > > > > > > > Hi Lohit,
> > > > > > > >
> > > > > > > > i am working with ray on a mmap performance improvement right 
> > > > > > > > now, which most likely has the same root cause as yours , see 
> > > > > > > > -->  
> > > > > > > > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> > > > > > > > the thread above is silent after a couple of back and rorth, 
> > > > > > > > but ray and i have active communication in the background and 
> > > > > > > > will repost as soon as there is something new to share.
> > > > > > > > i am happy to look at this issue after we finish with ray's 
> > > > > > > > workload if there is something missing, but first let's finish 
> > > > > > > > his, get you try the same fix and see if there is something 
> > > > > > > > missing.
> > > > > > > >
> > > > > > > > btw. if people would share their use of MMAP , what 
> > > > > > > > applications they use (home grown, just use lmdb which uses 
> > > > > > > > mmap under the cover, etc) please let me know so i get a better 
> > > > > > > > picture on how wide the usage is with GPFS. i know a lot of the 
> > > > > > > > ML/DL workloads are using it, but i would like to know what 
> > > > > > > > else is out there i might not think about. feel free to drop me 
> > > > > > > > a personal note, i might not reply to it right away, but 
> > > > > > > > eventually.
> > > > > > > >
> > > > > > > > thx. sven
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Thu, Feb 22, 2018 at 12:33 PM <[email protected]> 
> > > > > > > > > wrote:
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I wanted to know, how does mmap interact with GPFS pagepool 
> > > > > > > > > > with respect to filesystem block-size?
> > > > > > > > > > Does the efficiency depend on the mmap read size and the 
> > > > > > > > > > block-size of the filesystem even if all the data is cached 
> > > > > > > > > > in pagepool?
> > > > > > > > > >
> > > > > > > > > > GPFS 4.2.3.2 and CentOS7.
> > > > > > > > > >
> > > > > > > > > > Here is what i observed:
> > > > > > > > > >
> > > > > > > > > > I was testing a user script that uses mmap to read from 
> > > > > > > > > > 100M to 500MB files.
> > > > > > > > > >
> > > > > > > > > > The above files are stored on 3 different filesystems.
> > > > > > > > > >
> > > > > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > > > > > > > > >
> > > > > > > > > > 1. 4M block size GPFS filesystem, with separate metadata 
> > > > > > > > > > and data. Data on Near line and metadata on SSDs
> > > > > > > > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, 
> > > > > > > > > > "with all the required files fully cached" from the above 
> > > > > > > > > > GPFS cluster as home. Data and Metadata together on SSDs
> > > > > > > > > > 3. 16M block size GPFS filesystem, with separate metadata 
> > > > > > > > > > and data. Data on Near line and metadata on SSDs
> > > > > > > > > >
> > > > > > > > > > When i run the script first time for “each" filesystem:
> > > > > > > > > > I see that GPFS reads from the files, and caches into the 
> > > > > > > > > > pagepool as it reads, from mmdiag -- iohist
> > > > > > > > > >
> > > > > > > > > > When i run the second time, i see that there are no IO 
> > > > > > > > > > requests from the compute node to GPFS NSD servers, which 
> > > > > > > > > > is expected since all the data from the 3 filesystems is 
> > > > > > > > > > cached.
> > > > > > > > > >
> > > > > > > > > > However - the time taken for the script to run for the 
> > > > > > > > > > files in the 3 different filesystems is different - 
> > > > > > > > > > although i know that they are just "mmapping"/reading from 
> > > > > > > > > > pagepool/cache and not from disk.
> > > > > > > > > >
> > > > > > > > > > Here is the difference in time, for IO just from pagepool:
> > > > > > > > > >
> > > > > > > > > > 20s 4M block size
> > > > > > > > > > 15s 1M block size
> > > > > > > > > > 40S 16M block size.
> > > > > > > > > >
> > > > > > > > > > Why do i see a difference when trying to mmap reads from 
> > > > > > > > > > different block-size filesystems, although i see that the 
> > > > > > > > > > IO requests are not hitting disks and just the pagepool?
> > > > > > > > > >
> > > > > > > > > > I am willing to share the strace output and mmdiag outputs 
> > > > > > > > > > if needed.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Lohit
> > > > > > > > > >
> > > > > > > > > > _______________________________________________
> > > > > > > > > > gpfsug-discuss mailing list
> > > > > > > > > > gpfsug-discuss at spectrumscale.org
> > > > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > > > > > > _______________________________________________
> > > > > > > > gpfsug-discuss mailing list
> > > > > > > > gpfsug-discuss at spectrumscale.org
> > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > > > > > _______________________________________________
> > > > > > > gpfsug-discuss mailing list
> > > > > > > gpfsug-discuss at spectrumscale.org
> > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > > > > _______________________________________________
> > > > > > gpfsug-discuss mailing list
> > > > > > gpfsug-discuss at spectrumscale.org
> > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > > _______________________________________________
> > > > gpfsug-discuss mailing list
> > > > gpfsug-discuss at spectrumscale.org
> > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > > _______________________________________________
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size

Reply via email to