Hello All, This is a continuation to the previous discussion that i had with Sven. However against what i had mentioned previously - i realize that this is “not” related to mmap, and i see it when doing random freads.
I see that block-size of the filesystem matters when reading from Page pool. I see a major difference in performance when compared 1M to 16M, when doing lot of random small freads with all of the data in pagepool. Performance for 1M is a magnitude “more” than the performance that i see for 16M. The GPFS that we have currently is : Version : 5.0.1-0.5 Filesystem version: 19.01 (5.0.1.0) Block-size : 16M I had made the filesystem block-size to be 16M, thinking that i would get the most performance for both random/sequential reads from 16M than the smaller block-sizes. With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose lot of storage space even with 16M. I had run few benchmarks and i did see that 16M was performing better “when hitting storage/disks” with respect to bandwidth for random/sequential on small/large reads. However, with this particular workload - where it freads a chunk of data randomly from hundreds of files -> I see that the number of page-faults increase with block-size and actually reduce the performance. 1M performs a lot better than 16M, and may be i will get better performance with less than 1M. It gives the best performance when reading from local disk, with 4K block size filesystem. What i mean by performance when it comes to this workload - is not the bandwidth but the amount of time that it takes to do each iteration/read batch of data. I figure what is happening is: fread is trying to read a full block size of 16M - which is good in a way, when it hits the hard disk. But the application could be using just a small part of that 16M. Thus when randomly reading(freads) lot of data of 16M chunk size - it is page faulting a lot more and causing the performance to drop . I could try to make the application do read instead of freads, but i fear that could be bad too since it might be hitting the disk with a very small block size and that is not good. With the way i see things now - I believe it could be best if the application does random reads of 4k/1M from pagepool but some how does 16M from rotating disks. I don’t see any way of doing the above other than following a different approach where i create a filesystem with a smaller block size ( 1M or less than 1M ), on SSDs as a tier. May i please ask for advise, if what i am understanding/seeing is right and the best solution possible for the above scenario. Regards, Lohit On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru <vall...@cbio.mskcc.org>, wrote: > Hey Sven, > > This is regarding mmap issues and GPFS. > We had discussed previously of experimenting with GPFS 5. > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > I am yet to experiment with mmap performance, but before that - I am seeing > weird hangs with GPFS 5 and I think it could be related to mmap. > > Have you seen GPFS ever hang on this syscall? > [Tue Apr 10 04:20:13 2018] [<ffffffffa0a92155>] > _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > I see the above ,when kernel hangs and throws out a series of trace calls. > > I somehow think the above trace is related to processes hanging on GPFS > forever. There are no errors in GPFS however. > > Also, I think the above happens only when the mmap threads go above a > particular number. > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to > 4.2.3.2 . At that time , the issue happened when mmap threads go more than > worker1threads. According to the ticket - it was a mmap race condition that > GPFS was not handling well. > > I am not sure if this issue is a repeat and I am yet to isolate the incident > and test with increasing number of mmap threads. > > I am not 100 percent sure if this is related to mmap yet but just wanted to > ask you if you have seen anything like above. > > Thanks, > > Lohit > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme <oeh...@gmail.com>, wrote: > > Hi Lohit, > > > > i am working with ray on a mmap performance improvement right now, which > > most likely has the same root cause as yours , see --> > > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > > the thread above is silent after a couple of back and rorth, but ray and i > > have active communication in the background and will repost as soon as > > there is something new to share. > > i am happy to look at this issue after we finish with ray's workload if > > there is something missing, but first let's finish his, get you try the > > same fix and see if there is something missing. > > > > btw. if people would share their use of MMAP , what applications they use > > (home grown, just use lmdb which uses mmap under the cover, etc) please let > > me know so i get a better picture on how wide the usage is with GPFS. i > > know a lot of the ML/DL workloads are using it, but i would like to know > > what else is out there i might not think about. feel free to drop me a > > personal note, i might not reply to it right away, but eventually. > > > > thx. sven > > > > > > > On Thu, Feb 22, 2018 at 12:33 PM <vall...@cbio.mskcc.org> wrote: > > > > Hi all, > > > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with > > > > respect to filesystem block-size? > > > > Does the efficiency depend on the mmap read size and the block-size of > > > > the filesystem even if all the data is cached in pagepool? > > > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > > > Here is what i observed: > > > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB > > > > files. > > > > > > > > The above files are stored on 3 different filesystems. > > > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data > > > > on Near line and metadata on SSDs > > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the > > > > required files fully cached" from the above GPFS cluster as home. Data > > > > and Metadata together on SSDs > > > > 3. 16M block size GPFS filesystem, with separate metadata and data. > > > > Data on Near line and metadata on SSDs > > > > > > > > When i run the script first time for “each" filesystem: > > > > I see that GPFS reads from the files, and caches into the pagepool as > > > > it reads, from mmdiag -- iohist > > > > > > > > When i run the second time, i see that there are no IO requests from > > > > the compute node to GPFS NSD servers, which is expected since all the > > > > data from the 3 filesystems is cached. > > > > > > > > However - the time taken for the script to run for the files in the 3 > > > > different filesystems is different - although i know that they are just > > > > "mmapping"/reading from pagepool/cache and not from disk. > > > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > > > 20s 4M block size > > > > 15s 1M block size > > > > 40S 16M block size. > > > > > > > > Why do i see a difference when trying to mmap reads from different > > > > block-size filesystems, although i see that the IO requests are not > > > > hitting disks and just the pagepool? > > > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > > > Thanks, > > > > Lohit > > > > > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss