seem like you never read my performance presentation from a few years ago ;-)
you can control this on a per node basis , either for all i/o : prefetchAggressiveness = X or individual for reads or writes : prefetchAggressivenessRead = X prefetchAggressivenessWrite = X for a start i would turn it off completely via : mmchconfig prefetchAggressiveness=0 -I -N nodename that will turn it off only for that node and only until you restart the node. then see what happens sven On Wed, Sep 19, 2018 at 11:07 AM <vall...@cbio.mskcc.org> wrote: > Thank you Sven. > > I mostly think it could be 1. or some other issue. > I don’t think it could be 2. , because i can replicate this issue no > matter what is the size of the dataset. It happens for few files that could > easily fit in the page pool too. > > I do see a lot more page faults for 16M compared to 1M, so it could be > related to many threads trying to compete for the same buffer space. > > I will try to take the trace with trace=io option and see if can find > something. > > How do i turn of prefetching? Can i turn it off for a single node/client? > > Regards, > Lohit > > On Sep 18, 2018, 5:23 PM -0400, Sven Oehme <oeh...@gmail.com>, wrote: > > Hi, > > taking a trace would tell for sure, but i suspect what you might be > hitting one or even multiple issues which have similar negative performance > impacts but different root causes. > > 1. this could be serialization around buffer locks. as larger your > blocksize gets as larger is the amount of data one of this pagepool buffers > will maintain, if there is a lot of concurrency on smaller amount of data > more threads potentially compete for the same buffer lock to copy stuff in > and out of a particular buffer, hence things go slower compared to the same > amount of data spread across more buffers, each of smaller size. > > 2. your data set is small'ish, lets say a couple of time bigger than the > pagepool and you random access it with multiple threads. what will happen > is that because it doesn't fit into the cache it will be read from the > backend. if multiple threads hit the same 16 mb block at once with multiple > 4k random reads, it will read the whole 16mb block because it thinks it > will benefit from it later on out of cache, but because it fully random the > same happens with the next block and the next and so on and before you get > back to this block it was pushed out of the cache because of lack of enough > pagepool. > > i could think of multiple other scenarios , which is why its so hard to > accurately benchmark an application because you will design a benchmark to > test an application, but it actually almost always behaves different then > you think it does :-) > > so best is to run the real application and see under which configuration > it works best. > > you could also take a trace with trace=io and then look at > > TRACE_VNOP: READ: > TRACE_VNOP: WRITE: > > and compare them to > > TRACE_IO: QIO: read > TRACE_IO: QIO: write > > and see if the numbers summed up for both are somewhat equal. if > TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o > than you should and turning prefetching off might actually make things > faster . > > keep in mind i am no longer working for IBM so all i say might be obsolete > by now, i no longer have access to the one and only truth aka the source > code ... but if i am wrong i am sure somebody will point this out soon ;-) > > sven > > > > > On Tue, Sep 18, 2018 at 10:31 AM <vall...@cbio.mskcc.org> wrote: > >> Hello All, >> >> This is a continuation to the previous discussion that i had with Sven. >> However against what i had mentioned previously - i realize that this is >> “not” related to mmap, and i see it when doing random freads. >> >> I see that block-size of the filesystem matters when reading from Page >> pool. >> I see a major difference in performance when compared 1M to 16M, when >> doing lot of random small freads with all of the data in pagepool. >> >> Performance for 1M is a magnitude “more” than the performance that i see >> for 16M. >> >> The GPFS that we have currently is : >> Version : 5.0.1-0.5 >> Filesystem version: 19.01 (5.0.1.0) >> Block-size : 16M >> >> I had made the filesystem block-size to be 16M, thinking that i would get >> the most performance for both random/sequential reads from 16M than the >> smaller block-sizes. >> With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not >> loose lot of storage space even with 16M. >> I had run few benchmarks and i did see that 16M was performing better >> “when hitting storage/disks” with respect to bandwidth for >> random/sequential on small/large reads. >> >> However, with this particular workload - where it freads a chunk of data >> randomly from hundreds of files -> I see that the number of page-faults >> increase with block-size and actually reduce the performance. >> 1M performs a lot better than 16M, and may be i will get better >> performance with less than 1M. >> It gives the best performance when reading from local disk, with 4K block >> size filesystem. >> >> What i mean by performance when it comes to this workload - is not the >> bandwidth but the amount of time that it takes to do each iteration/read >> batch of data. >> >> I figure what is happening is: >> fread is trying to read a full block size of 16M - which is good in a >> way, when it hits the hard disk. >> But the application could be using just a small part of that 16M. Thus >> when randomly reading(freads) lot of data of 16M chunk size - it is page >> faulting a lot more and causing the performance to drop . >> I could try to make the application do read instead of freads, but i fear >> that could be bad too since it might be hitting the disk with a very small >> block size and that is not good. >> >> With the way i see things now - >> I believe it could be best if the application does random reads of 4k/1M >> from pagepool but some how does 16M from rotating disks. >> >> I don’t see any way of doing the above other than following a different >> approach where i create a filesystem with a smaller block size ( 1M or less >> than 1M ), on SSDs as a tier. >> >> May i please ask for advise, if what i am understanding/seeing is right >> and the best solution possible for the above scenario. >> >> Regards, >> Lohit >> >> On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru <vall...@cbio.mskcc.org>, >> wrote: >> >> Hey Sven, >> >> This is regarding mmap issues and GPFS. >> We had discussed previously of experimenting with GPFS 5. >> >> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 >> >> I am yet to experiment with mmap performance, but before that - I am >> seeing weird hangs with GPFS 5 and I think it could be related to mmap. >> >> Have you seen GPFS ever hang on this syscall? >> [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>] >> _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] >> >> I see the above ,when kernel hangs and throws out a series of trace calls. >> >> I somehow think the above trace is related to processes hanging on GPFS >> forever. There are no errors in GPFS however. >> >> Also, I think the above happens only when the mmap threads go above a >> particular number. >> >> We had faced a similar issue in 4.2.3 and it was resolved in a patch to >> 4.2.3.2 . At that time , the issue happened when mmap threads go more than >> worker1threads. According to the ticket - it was a mmap race condition that >> GPFS was not handling well. >> >> I am not sure if this issue is a repeat and I am yet to isolate the >> incident and test with increasing number of mmap threads. >> >> I am not 100 percent sure if this is related to mmap yet but just wanted >> to ask you if you have seen anything like above. >> >> Thanks, >> >> Lohit >> >> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <oeh...@gmail.com>, wrote: >> >> Hi Lohit, >> >> i am working with ray on a mmap performance improvement right now, which >> most likely has the same root cause as yours , see --> >> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html >> the thread above is silent after a couple of back and rorth, but ray and >> i have active communication in the background and will repost as soon as >> there is something new to share. >> i am happy to look at this issue after we finish with ray's workload if >> there is something missing, but first let's finish his, get you try the >> same fix and see if there is something missing. >> >> btw. if people would share their use of MMAP , what applications they use >> (home grown, just use lmdb which uses mmap under the cover, etc) please let >> me know so i get a better picture on how wide the usage is with GPFS. i >> know a lot of the ML/DL workloads are using it, but i would like to know >> what else is out there i might not think about. feel free to drop me a >> personal note, i might not reply to it right away, but eventually. >> >> thx. sven >> >> >> On Thu, Feb 22, 2018 at 12:33 PM <vall...@cbio.mskcc.org> wrote: >> >>> Hi all, >>> >>> I wanted to know, how does mmap interact with GPFS pagepool with respect >>> to filesystem block-size? >>> Does the efficiency depend on the mmap read size and the block-size of >>> the filesystem even if all the data is cached in pagepool? >>> >>> GPFS 4.2.3.2 and CentOS7. >>> >>> Here is what i observed: >>> >>> I was testing a user script that uses mmap to read from 100M to 500MB >>> files. >>> >>> The above files are stored on 3 different filesystems. >>> >>> Compute nodes - 10G pagepool and 5G seqdiscardthreshold. >>> >>> 1. 4M block size GPFS filesystem, with separate metadata and data. Data >>> on Near line and metadata on SSDs >>> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the >>> required files fully cached" from the above GPFS cluster as home. Data and >>> Metadata together on SSDs >>> 3. 16M block size GPFS filesystem, with separate metadata and data. Data >>> on Near line and metadata on SSDs >>> >>> When i run the script first time for “each" filesystem: >>> I see that GPFS reads from the files, and caches into the pagepool as it >>> reads, from mmdiag -- iohist >>> >>> When i run the second time, i see that there are no IO requests from the >>> compute node to GPFS NSD servers, which is expected since all the data from >>> the 3 filesystems is cached. >>> >>> However - the time taken for the script to run for the files in the 3 >>> different filesystems is different - although i know that they are just >>> "mmapping"/reading from pagepool/cache and not from disk. >>> >>> Here is the difference in time, for IO just from pagepool: >>> >>> 20s 4M block size >>> 15s 1M block size >>> 40S 16M block size. >>> >>> Why do i see a difference when trying to mmap reads from different >>> block-size filesystems, although i see that the IO requests are not hitting >>> disks and just the pagepool? >>> >>> I am willing to share the strace output and mmdiag outputs if needed. >>> >>> Thanks, >>> Lohit >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss