Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Sat, 2006-Mar-25 21:39:27 +1100, Peter Jeremy wrote: What happens if you simulate read-ahead yourself? Have your main program fork and the child access pages slightly ahead of the parent but do nothing else. I suspect something like this may be the best approach for your application. My suggestion would be to split the backup into 3 processes that share memory. I wrote a program that is designed to buffer data in what looks like a big FIFO and dump | myfifo | gzip file.gz is significantly faster than dump | gzip file.gz so I suspect it will help you as well. Process 1 reads the input file into mmap A. Process 2 {b,gz}ips's mmap A into mmap B. Process 3 writes mmap B into the output file. Process 3 and mmap B may be optional, depending on your target's write performance. mmap A could be the real file with process 1 just accessing pages to force them into RAM. I'd suggest that each mmap be capable of storing several hundred msec of data as a minumum (maybe 10MB input and 5MB output, preferably more). Synchronisation can be done by writing tokens into pipes shared with the mmap's, optimised by sharing read/write pointers (so you only really need the tokens when the shared buffer is full/empty). -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
вівторок 28 березень 2006 05:27, Peter Jeremy написав: I'd suggest that each mmap be capable of storing several hundred msec of data as a minumum (maybe 10MB input and 5MB output, preferably more). Synchronisation can be done by writing tokens into pipes shared with the mmap's, optimised by sharing read/write pointers (so you only really need the tokens when the shared buffer is full/empty). Thank you very much, Peter, for your suggestions. Unfortunately, I have no control whatsoever over the dump-ing part of the process. The dump is done by Sybase database servers -- old, clunky, and closed-source software, running on slow CPU (but good I/O) Sun hardware. You are right, of course, that my application (mzip being only part of it) needs to keep the dumper and the compressor in sync. Without any cooperation from the former, however, I see no other way but to temporarily throttle the NFS-bandwidth via firewall, when the compressor falls behind (as can be detected by the increased proportion of sys-time, I guess). Much as I apreciate the (past and future) help and suggestions, I'm not asking you, nor the mailing list to solve my particular problem here :-) I only gave the details of my need and application to illustrate a missed general optimization opportunity in FreeBSD -- reading large files via mmap need not be slower than via read. If anything, it should be (slightly) faster. After many days Matt has finally stated (admitted? ;-): read() uses a different heuristic then mmap() to implement the read-ahead. There is also code in there which depresses the page priority of 'old' already-read pages in the sequential case. There is no reason not to implement similar smarts in the mmap-handling code to similarly depress the priority of the in-memory pages in the MADV_SEQUENTIAL case, thus freeing more RAM for aggressive read-ahead. As I admitted before, actually implementing this far exceeds my own capabilities, so all I can do is pester, whoever cares, to do it instead :-) C'mon, guys... -mi ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Saturday 25 March 2006 06:46 pm, Peter Jeremy wrote: = My guess is that the read-ahead algorithms are working but aren't doing = enough re-ahead to cope with read a bit, do some cpu-intensive processing = and repeat at 25MB/sec so you're winding up with a degree of serialisation = where the I/O and compressing aren't overlapped. I'm not sure how tunable = the read-ahead is. Well, is the MADV_SEQUNTIAL advice, given over the entire mmap-ed region, taken into account anywhere in the kernel? The kernel could read-ahead more aggressively if it freed the just accessed pages faster, than it does in the default case... Matt wrote in the same thread: =It is particularly possible when you combine read() with =mmap because read() uses a different heuristic then mmap() to =implement the read-ahead. There is also code in there which depresses =the page priority of 'old' already-read pages in the sequential case. Well, thanks for the theoretical confirmation of what I was trying to prove by experiments :-) Can this depressing of the old pages in the sequential case, that read's implementation already has, be also implemented in mmap's case? It may not *always* be, what the mmap-ing program wants, but when the said program uses MADV_SEQUENTAIL, it should not be ignored... (Bakul understood this point of mine 3 days ago :-) Peter Jeremy also wrote, in another message: = I can't test is as-is because it insists on mmap'ing its output and I only = have one disk and you can't mmap /dev/null. If you use a well compressible (redundant) file, such as a web-server log, and a high enough compression ratio, you can use the same disk for output -- the writes will be very infrequent. Thanks! Yours, -mi ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Fri, 2006-Mar-24 10:00:20 -0800, Matthew Dillon wrote: Ok. The next test is to NOT do umount/remount and then use a data set that is ~2x system memory (but can still be mmap'd by grep). Rerun the data set multiple times using grep and grep --mmap. The results here are weird. With 1GB RAM and a 2GB dataset, the timings seem to depend on the sequence of operations: reading is significantly faster, but only when the data was mmap'd previously There's one outlier that I can't easily explain. hw.physmem: 932249600 hw.usermem: 815050752 + ls -l /6_i386/var/tmp/test -rw-r--r-- 1 peter wheel 2052167894 Mar 25 05:44 /6_i386/var/tmp/test + /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test + /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test This was done in multi-user on a VTY using a script. X was running (and I forgot to kill an xclock) but there shouldn't have been anything else happening. grep --mmap followed by grep --mmap: mm 77.94 real 1.65 user 2.08 sys mm 78.22 real 1.53 user 2.21 sys mm 78.34 real 1.55 user 2.21 sys mm 79.33 real 1.48 user 2.37 sys grep --mmap followed by grep/read mr 56.64 real 0.77 user 2.45 sys mr 56.73 real 0.67 user 2.53 sys mr 56.86 real 0.68 user 2.60 sys mr 57.64 real 0.64 user 2.63 sys mr 57.71 real 0.62 user 2.68 sys mr 58.04 real 0.63 user 2.59 sys mr 58.83 real 0.78 user 2.50 sys mr 59.15 real 0.74 user 2.50 sys grep/read followed by grep --mmap rm 75.98 real 1.56 user 2.19 sys rm 76.06 real 1.50 user 2.29 sys rm 76.50 real 1.40 user 2.38 sys rm 77.35 real 1.47 user 2.30 sys rm 77.49 real 1.39 user 2.44 sys rm 79.14 real 1.56 user 2.19 sys rm 88.88 real 1.57 user 2.27 sys grep/read followed by grep/read rr 78.00 real 0.69 user 2.74 sys rr 78.34 real 0.67 user 2.74 sys rr 79.64 real 0.69 user 2.71 sys rr 79.69 real 0.73 user 2.75 sys free and cache pages. The system would only be allocating ~60MB/s (or whatever your disk can do), so the pageout thread ought to be able to keep up. This is a laptop so the disk can only manage a bit over 25 MB/sec. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Fri, 2006-Mar-24 15:18:00 -0500, Mikhail Teterin wrote: which there is not with the read. Read also requires fairly large buffers in the user space to be efficient -- *in addition* to the buffers in the kernel. I disagree. With a filesystem read, the kernel is solely responsible for handling physical I/O with an efficient buffer size. The userland buffers simply amortise the cost of the system call and copyout overheads. I'm also quite certain, that fulfulling my demands would add quite a bit of complexity to the mmap support in kernel, but hey, that's what the kernel is there for :-) Unfortunately, your patches to implement this seem to have become detached from your e-mail. :-) Unlike grep, which seems to use only 32k buffers anyway (and does not use madvise -- see attachment), my program mmaps gigabytes of the input file at once, trusting the kernel to do a better job at reading the data in the most efficient manner :-) mmap can lend itself to cleaner implementatione because there's no need to have a nested loop to read buffers and then process them. You can mmap then entire file and process it. The downside is that on a 32-bit architecture, this limits you to processing files that are somewhat less than 2GB. The downside is that touching an uncached page triggers a trap which may not be as efficient as reading a block of data through the filesystem interface, and I/O errors are delivered via signals (which may not be as easy to handle). Peter Jeremy wrote: On an amd64 system running about 6-week old -stable, both ['grep' and 'grep --mmap' -mi] behave pretty much identically. Peter, I read grep's source -- it is not using madvise (because it hurts performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care to look at my program instead? Thanks: http://aldan.algebra.com/mzip.c fetch: http://aldan.algebra.com/mzip.c: Not Found I tried writing a program that just mmap'd my entire (2GB) test file and summed all the longwords in it. This gave me similar results to grep. Setting MADV_SEQUENTIAL and/or MADV_WILLNEED made no noticable difference. I suspect something about your code or system is disabling the mmap read-ahead functionality. What happens if you simulate read-ahead yourself? Have your main program fork and the child access pages slightly ahead of the parent but do nothing else. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Saturday 25 March 2006 05:39 am, Peter Jeremy wrote: = On Fri, 2006-Mar-24 15:18:00 -0500, Mikhail Teterin wrote: = which there is not with the read. Read also requires fairly large = buffers in the user space to be efficient -- *in addition* to the = buffers in the kernel. = = I disagree. With a filesystem read, the kernel is solely responsible = for handling physical I/O with an efficient buffer size. The userland = buffers simply amortise the cost of the system call and copyout = overheads. I don't see a disagreement in the above :-) Mmap API can be slightly faster than read -- kernel is still responsible for handling physical I/O with an efficient buffer size. But instead of copying the data out after reading, it can read it directly into the process' memory. = I'm also quite certain, that fulfulling my demands would add quite a = bit of complexity to the mmap support in kernel, but hey, that's what the = kernel is there for :-) = = Unfortunately, your patches to implement this seem to have become detached = from your e-mail. :-) If I manage to *convince* someone, that there is a problem to solve, I'll consider it a good contribution to the project... = mmap can lend itself to cleaner implementatione because there's no = need to have a nested loop to read buffers and then process them. You = can mmap then entire file and process it. The downside is that on a = 32-bit architecture, this limits you to processing files that are = somewhat less than 2GB. First, only one of our architectures is 32-bit :-) On 64-bit systems, the addressable memory (kind of) matches the maximum file size. Second even with the loop reading/processing chunks at a time, the implementation is cleaner, because it does not need to allocate any memory nor try to guess, which buffer size to pick for optimal performance, nor align the buffers on pages (which grep is doing, for example, rather hairily). = The downside is that touching an uncached page triggers a trap which may = not be as efficient as reading a block of data through the filesystem = interface, and I/O errors are delivered via signals (which may not be as = easy to handle). My point exactly. It does seem to be less efficient *at the moment* and I am trying to have the kernel support for this cleaner method of reading *improved*. By convincing someone with a clue to do it, that is... :-) = Would you care to look at my program instead? Thanks: = = http://aldan.algebra.com/mzip.c I'm sorry, that should be http://aldan.algebra.com/~mi/mzip.c -- I checked this time :-( = I tried writing a program that just mmap'd my entire (2GB) test file = and summed all the longwords in it. The files I'm dealing with are database dumps -- 10-80Gb :-) Maybe, that's, what triggers some pessimal case?.. Thanks! Yours, -mi ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:The results here are weird. With 1GB RAM and a 2GB dataset, the :timings seem to depend on the sequence of operations: reading is :significantly faster, but only when the data was mmap'd previously :There's one outlier that I can't easily explain. :... :Peter Jeremy Really odd. Note that if your disk can only do 25 MBytes/sec, the calculation is: 2052167894 / 25MB = ~80 seconds, not ~60 seconds as you would expect from your numbers. So that would imply that the 80 second numbers represent read-ahead, and the 60 second numbers indicate that some of the data was retained from a prior run (and not blown out by the sequential reading in the later run). This type of situation *IS* possible as a side effect of other heuristics. It is particularly possible when you combine read() with mmap because read() uses a different heuristic then mmap() to implement the read-ahead. There is also code in there which depresses the page priority of 'old' already-read pages in the sequential case. So, for example, if you do a linear grep of 2GB you might end up with a cache state that looks like this: l = low priority page m = medium priority page h = high priority page FILE: [---m] Then when you rescan using mmap, FILE: [l--m] [--lm] [-l-m] [l--m] [---l---m] [--lm] [-llHHHmm] [lllHHmmm] [---H] [---mmmHm] The low priority pages don't bump out the medium priority pages from the previous scan, so the grep winds up doing read-ahead until it hits the large swath of pages already cached from the previous scan, without bumping out those pages. There is also a heuristic in the system (FreeBSD and DragonFly) which tries to randomly retain pages. It clearly isn't working :-) I need to change it to randomly retain swaths of pages, the idea being that it should take repeated runs to rebalance the VM cache rather then allowing a single run to blow it out or allowing a static set of pages to be retained indefinitely, which is what your tests seem to show is occuring. -Matt Matthew Dillon [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
Mikhail Teterin wrote this message on Sat, Mar 25, 2006 at 09:20 -0500: = The downside is that touching an uncached page triggers a trap which may = not be as efficient as reading a block of data through the filesystem = interface, and I/O errors are delivered via signals (which may not be as = easy to handle). My point exactly. It does seem to be less efficient *at the moment* and I am trying to have the kernel support for this cleaner method of reading *improved*. By convincing someone with a clue to do it, that is... :-) I think the thing is that there isn't an easy way to speed up the faulting of the page, and that is why you are getting such trouble making people believe that there is a problem... To convince people that there is a problem, you need to run benchmarks, and make code modifications to show that yes, something can be done to improve the performance... The other useful/interesting number would be to compare system time between the mmap case and the read case to see how much work the kernel is doing in each case... -- John-Mark Gurney Voice: +1 415 225 5579 All that I will do, has been done, All that I have, has not. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Sat, 2006-Mar-25 10:29:17 -0800, Matthew Dillon wrote: Really odd. Note that if your disk can only do 25 MBytes/sec, the calculation is: 2052167894 / 25MB = ~80 seconds, not ~60 seconds as you would expect from your numbers. systat was reporting 25-26 MB/sec. dd'ing the underlying partition gives 27MB/sec (with 24 and 28 for adjacent partions). This type of situation *IS* possible as a side effect of other heuristics. It is particularly possible when you combine read() with mmap because read() uses a different heuristic then mmap() to implement the read-ahead. There is also code in there which depresses the page priority of 'old' already-read pages in the sequential case. So, for example, if you do a linear grep of 2GB you might end up with a cache state that looks like this: If I've understood you correctly, this also implies that the timing depends on the previous two scans, not just the previous scan. I didn't test all combinations of this but would have expected to see two distinct sets of mmap/read timings - one for read/mmap/read and one for mmap/mmap/read. I need to change it to randomly retain swaths of pages, the idea being that it should take repeated runs to rebalance the VM cache rather then allowing a single run to blow it out or allowing a static set of pages to be retained indefinitely, which is what your tests seem to show is occuring. I dont think this sort of test is a clear indication that something is wrong. There's only one active process at any time and it's performing a sequential read of a large dataset. In this case, evicting already cached data to read new data is not necessarily productive (a simple- minded algorithm will be evicting data this is going to be accessed in the near future). Based on the timings, mmap/read case manages to retain ~15% of the file in cache. Given the amount of RAM available, the theoretical limit is about 40% so this isn't too bad. It would be nicer if both read and mmap managed this gain, irrespective of how the data had been previously accessed. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Sat, 2006-Mar-25 09:20:13 -0500, Mikhail Teterin wrote: I'm sorry, that should be http://aldan.algebra.com/~mi/mzip.c -- I checked this time :-( It doesn't look like it's doing anything especially weird. As Matt pointed out, creating files with mmap() is not a good idea because the syncer can cause massive fragmentation when allocating space. I can't test is as-is because it insists on mmap'ing its output and I only have one disk and you can't mmap /dev/null. Since your program is already written to mmap the input and output in pieces, it would be trivial to convert it to use read/write. = I tried writing a program that just mmap'd my entire (2GB) test file = and summed all the longwords in it. The files I'm dealing with are database dumps -- 10-80Gb :-) Maybe, that's, what triggers some pessimal case?.. I tried generating an 11GB test file and got results consistent with my previous tests: grep using read or mmap, as well as mmap'ing the entire file give similar times with the disk mostly saturated. I suggest you try converting mzip.c to use read/write and see if the problem is still present. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Fri, 2006-Mar-24 15:18:00 -0500, Mikhail Teterin wrote: On the machine, where both mzip and the disk run at only 50%, the disk is a plain SATA drive (mzip's state goes from RUN to vnread and back). ... 18 usersLoad 0.46 0.53 0.60 24 ??? 15:15 Mem:KBREALVIRTUAL VN PAGER SWAP PAGER Tot Share TotShareFree in out in out Act 18338645880 2775855245268 92216 count 240 All 18811885992 1432466k52864 pages 3413 Interrupts Proc:r p d s wCsw Trp Sys Int Sof Fltcow2252 total 1 2101 1605 2025 197 4222 2018 251432 wireirq1: atkb 506156 act irq6: fdc0 3.0%Sys 0.0%Intr 45.2%User 0.0%Nice 51.9%Idl 1038216 inact irq15: ata |||||||||| 89252 cache irq17: fwo = 2964 freeirq20: nve daefr irq21: ohc Namei Name-cacheDir-cache prcfr 241 irq22: ehc Calls hits% hits% 951 react11 irq25: em0 pdwak irq29: amr 618 zfodpdpgs 2000 cpu0: time Disks ad4 amrd0 ofodintrn KB/t 56.79 0.00 %slo-z 200816 buf tps 241 05143 tfree 8 dirtybuf MB/s 13.38 0.00 10 desiredvnodes % busy 47 0 34717 numvnodes 24991 freevnodes OK. I _can_ see something like this when I try to compress a big file using either your program or gzip. In my case, both the disk % busy and system idle vary widely but there's typicaly 50-60% disk utilisation and 30-40% CPU idle. However, systat is reporting 23-25MB/sec (whereas dd peaks at ~30MB/sec) so the time to gzip the datafile isn't that much different to the time to just read it. My guess is that the read-ahead algorithms are working but aren't doing enough re-ahead to cope with read a bit, do some cpu-intensive processing and repeat at 25MB/sec so you're winding up with a degree of serialisation where the I/O and compressing aren't overlapped. I'm not sure how tunable the read-ahead is. -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Thu, 2006-Mar-23 15:16:11 -0800, Matthew Dillon wrote: FreeBSD. To determine which of the two is more likely, you have to run a smaller data set (like 600MB of data on a system with 1GB of ram), and use the unmount/mount trick to clear the cache before each grep test. On an amd64 system running about 6-week old -stable, both behave pretty much identically. In both cases, systat reports that the disk is about 96% busy whilst loading the cache. In the cache case, mmap is significantly faster. The test data is 2 copies of OOo_2.0.2rc2_src.tar.gz concatenated. turion% ls -l /6_i386/var/tmp/test -rw-r--r-- 1 peter wheel 586333684 Mar 24 19:24 /6_i386/var/tmp/test turion% /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test 21.69 real 0.16 user 0.68 sys 1064 maximum resident set size 82 average shared memory size 95 average unshared data size 138 average unshared stack size 119 page reclaims 0 page faults 0 swaps 4499 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 4497 voluntary context switches 3962 involuntary context switches [umount/remount /6_i386/var] turion% /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test 21.68 real 0.41 user 0.51 sys 1068 maximum resident set size 80 average shared memory size 93 average unshared data size 136 average unshared stack size 17836 page reclaims 18081 page faults 0 swaps 23 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 18105 voluntary context switches 169 involuntary context switches The speed gain with mmap is clearly evident when the data is cached and the CPU clock wound right down (99MHz ISO 2200MHz): turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test 12.15 real 7.98 user 2.95 sys turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test 12.28 real 7.92 user 2.94 sys turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test 13.16 real 8.03 user 2.89 sys turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test 17.09 real 6.37 user 8.92 sys turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test 17.36 real 6.35 user 9.37 sys turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test 17.54 real 6.37 user 9.39 sys -- Peter Jeremy ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:On an amd64 system running about 6-week old -stable, both behave :pretty much identically. In both cases, systat reports that the disk :is about 96% busy whilst loading the cache. In the cache case, mmap :is significantly faster. : :... :turion% ls -l /6_i386/var/tmp/test :-rw-r--r-- 1 peter wheel 586333684 Mar 24 19:24 /6_i386/var/tmp/test :turion% /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test : 21.69 real 0.16 user 0.68 sys :[umount/remount /6_i386/var] :turion% /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test : 21.68 real 0.41 user 0.51 sys :The speed gain with mmap is clearly evident when the data is cached and :the CPU clock wound right down (99MHz ISO 2200MHz): :... :-- :Peter Jeremy That pretty much means that the read-ahead algorithm is working. If it weren't, the disk would not be running at near 100%. Ok. The next test is to NOT do umount/remount and then use a data set that is ~2x system memory (but can still be mmap'd by grep). Rerun the data set multiple times using grep and grep --mmap. If the times for the mmap case blow up relative to the non-mmap case, then the vm_page_alloc() calls and/or vm_page_count_severe() (and other tests) in the vm_fault case are causing the read-ahead to drop out. If this is the case the problem is not in the read-ahead path, but probably in the pageout code not maintaining a sufficient number of free and cache pages. The system would only be allocating ~60MB/s (or whatever your disk can do), so the pageout thread ought to be able to keep up. If the times for the mmap case do not blow up, we are back to square one and I would start investigating the disk driver that Mikhail is using. -Matt Matthew Dillon [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
Matthew Dillon wrote: It is possible that the kernel believes the VM system to be too loaded to issue read-aheads, as a consequence of your blowing out of the system caches. See attachment for the snapshot of `systat 1 -vm' -- it stays like that for the most of the compression run time with only occasional flushes to the amrd0 device (the destination for the compressed output). Bakul Shah followed up: May be the OS needs reclaim-behind for the sequential case? This way you can mmap many many pages and use a much smaller pool of physical pages to back them. The idea is for the VM to reclaim pages N-k..N-1 when page N is accessed and allow the same process to reuse this page. Although it may hard for the kernel to guess, which pages it can reclaim efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL should've given it a strong hint. It is for this reasons, that I very much prefer the mmap API to read/write (against Matt's repeated advice) -- there is a way to advise the kernel, which there is not with the read. Read also requires fairly large buffers in the user space to be efficient -- *in addition* to the buffers in the kernel. Managing such buffers properly makes the program far messier _and_ OS-dependent, than using the mmap interface has to be. I totally agree with Matt, that FreeBSD's (and probably DragonFly's too) mmap interface is better than others', but, it seems to me, there is plenty of room for improvement. Reading via mmap should never be slower, than via read -- it should be just a notch faster, in fact... I'm also quite certain, that fulfulling my demands would add quite a bit of complexity to the mmap support in kernel, but hey, that's what the kernel is there for :-) Unlike grep, which seems to use only 32k buffers anyway (and does not use madvise -- see attachment), my program mmaps gigabytes of the input file at once, trusting the kernel to do a better job at reading the data in the most efficient manner :-) Peter Jeremy wrote: On an amd64 system running about 6-week old -stable, both ['grep' and 'grep --mmap' -mi] behave pretty much identically. Peter, I read grep's source -- it is not using madvise (because it hurts performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care to look at my program instead? Thanks: http://aldan.algebra.com/mzip.c (link with -lz and -lbz2). Matthew Dillon wrote: [...] If the times for the mmap case do not blow up, we are back to square one and I would start investigating the disk driver that Mikhail is using. On the machine, where both mzip and the disk run at only 50%, the disk is a plain SATA drive (mzip's state goes from RUN to vnread and back). Thanks, everyone! -mi Index: grep.c === RCS file: /home/ncvs/src/gnu/usr.bin/grep/grep.c,v retrieving revision 1.31.2.1 diff -U2 -r1.31.2.1 grep.c --- grep.c 26 Oct 2005 21:13:30 - 1.31.2.1 +++ grep.c 24 Mar 2006 19:52:05 - @@ -427,9 +427,8 @@ PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED, bufdesc, bufoffset) - != (caddr_t) -1)) + != MAP_FAILED)) { - /* Do not bother to use madvise with MADV_SEQUENTIAL or - MADV_WILLNEED on the mmapped memory. One might think it - would help, but it slows us down about 30% on SunOS 4.1. */ + if (madvise(readbuf, mmapsize, MADV_SEQUENTIAL)) + warn(madvise); fillsize = mmapsize; } @@ -441,4 +440,6 @@ other process has an advisory read lock on the file. There's no point alarming the user about this misfeature. */ + if (mmapsize) + warn(mmap); bufmapped = 0; if (bufoffset != initial_bufoffset 18 usersLoad 0.46 0.53 0.60 24 бер 15:15 Mem:KBREALVIRTUAL VN PAGER SWAP PAGER Tot Share TotShareFree in out in out Act 18338645880 2775855245268 92216 count 240 All 18811885992 1432466k52864 pages 3413 Interrupts Proc:r p d s wCsw Trp Sys Int Sof Fltcow2252 total 1 2101 1605 2025 197 4222 2018 251432 wireirq1: atkb 506156 act irq6: fdc0 3.0%Sys 0.0%Intr 45.2%User 0.0%Nice 51.9%Idl 1038216 inact irq15: ata |||||||||| 89252 cache irq17: fwo = 2964 freeirq20: nve daefr irq21: ohc Namei Name-cacheDir-cache prcfr 241 irq22: ehc Calls hits% hits% 951 react11 irq25: em0 pdwak irq29: amr 618 zfod
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
May be the OS needs reclaim-behind for the sequential case? This way you can mmap many many pages and use a much smaller pool of physical pages to back them. The idea is for the VM to reclaim pages N-k..N-1 when page N is accessed and allow the same process to reuse this page. Although it may hard for the kernel to guess, which pages it can reclaim efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL should've given it a strong hint. Yes, that is what I was saying. If mmap read can be made as efficient as the use of read() for this most common case, there are benefits. In effect we set up a fifo that rolls along the mapped address range and the kernel processing and the user processing are somewhat decoupled. Reading via mmap should never be slower, than via read -- it should be just a notch faster, in fact... Depends on the cost of mostly redundant processing of N read() syscalls versus the cost of setting up and tearing down multiple v2p mappings -- presumably page faults can be avoided if the kernel fills in pages ahead of when they are first accessed. The cost of tlbmiss is likely minor. Probably the breakeven point is just a few read() calls. I'm also quite certain, that fulfulling my demands would add quite a bit of complexity to the mmap support in kernel, but hey, that's what the kernel is there for :-) An interesting thought experiment is to assume the system has *no* read and write calls and see how far you can get with the present mmap scheme and what extensions are needed to get back the same functionality. Yes, assume mmap friends even for serial IO! I am betting that mmap can be simplified. [Proof by handwaving elided; this screen is too small to fit my hands :-)] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
вівторок 21 березень 2006 17:48, Matthew Dillon Ви написали: Reading via mmap() is very well optimized. Actually, I can not agree here -- quite the opposite seems true. When running locally (no NFS involved) my compressor with the `-1' flag (fast, least effective compression), the program easily compresses faster, than it can read. The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. I guess, despite the noise I raised on this subject a year ago, reading via mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability. Unlike read, which uses buffering, mmap-reading still does not pre-fault the file's pieces in efficiently :-( Although the program was written to compress files, that are _likely_ still in memory, when used with regular files, it exposes the lack of mmap optimization. This should be even more obvious, if you time searching for a string in a large file using grep vs. 'grep --mmap'. Yours, -mi http://aldan.algebra.com/~mi/mzip.c ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:Actually, I can not agree here -- quite the opposite seems true. When running :locally (no NFS involved) my compressor with the `-1' flag (fast, least :effective compression), the program easily compresses faster, than it can :read. : :The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. :I guess, despite the noise I raised on this subject a year ago, reading via :mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability. : :Unlike read, which uses buffering, mmap-reading still does not pre-fault the :file's pieces in efficiently :-( : :Although the program was written to compress files, that are _likely_ still in :memory, when used with regular files, it exposes the lack of mmap :optimization. : :This should be even more obvious, if you time searching for a string in a :large file using grep vs. 'grep --mmap'. : :Yours, : : -mi : :http://aldan.algebra.com/~mi/mzip.c Well, I don't know about FreeBSD, but both grep cases work just fine on DragonFly. I can't test mzip.c because I don't see the compression library you are calling (maybe that's a FreeBSD thing). The results of the grep test ought to be similar for FreeBSD since the heuristic used by both OS's is the same. If they aren't, something might have gotten nerfed accidently in the FreeBSD tree. Here is the cache case test. mmap is clearly faster (though I would again caution that this should not be an implicit assumption since VM fault overheads can rival read() overheads, depending on the situation). The 'x1' file in all tests below is simply /usr/share/dict/words concactenated over and over again to produce a large file. crater# ls -la x1 -rw-r--r-- 1 root wheel 638228992 Mar 23 11:36 x1 [ machine has 1GB of ram ] crater# time grep --mmap asdfasf x1 1.000u 0.117s 0:01.11 100.0%10+40k 0+0io 0pf+0w crater# time grep --mmap asdfasf x1 0.976u 0.132s 0:01.13 97.3% 10+40k 0+0io 0pf+0w crater# time grep --mmap asdfasf x1 0.984u 0.140s 0:01.11 100.9%10+41k 0+0io 0pf+0w crater# time grep asdfasf x1 0.601u 0.781s 0:01.40 98.5% 10+42k 0+0io 0pf+0w crater# time grep asdfasf x1 0.507u 0.867s 0:01.39 97.8% 10+40k 0+0io 0pf+0w crater# time grep asdfasf x1 0.562u 0.812s 0:01.43 95.8% 10+41k 0+0io 0pf+0w crater# iostat 1 [ while grep is running, in order to test the cache case and verify that no I/O is occuring once the data has been cached ] The disk I/O case, which I can test by unmounting and remounting the partition containing the file in question, then running grep, seems to be well optimized on DragonFly. It should be similarly optimized on FreeBSD since the code that does this optimization is nearly the same. In my test, it is clear that the page-fault overhead in the uncached case is considerably greater then the copying overhead of a read(), though not by much. And I would expect that, too. test28# umount /home test28# mount /home test28# time grep asdfasdf /home/x1 0.382u 0.351s 0:10.23 7.1% 55+141k 42+0io 4pf+0w test28# umount /home test28# mount /home test28# time grep asdfasdf /home/x1 0.390u 0.367s 0:10.16 7.3% 48+123k 42+0io 0pf+0w test28# umount /home test28# mount /home test28# time grep --mmap asdfasdf /home/x1 0.539u 0.265s 0:10.53 7.5% 36+93k 42+0io 19518pf+0w test28# umount /home test28# mount /home test28# time grep --mmap asdfasdf /home/x1 0.617u 0.289s 0:10.47 8.5% 41+105k 42+0io 19518pf+0w test28# test28# iostat 1 during the test showed ~60MBytes/sec for all four tests Perhaps you should post specifics of the test you are running, as well as specifics of the results you are getting, such as the actual timing output instead of a human interpretation of the results. For that matter, being an opteron system, were you running the tests on a UP system or an SMP system? grep is a single-threaded so on a 2-cpu system it will show 50% cpu utilization since one cpu will be saturated and the other idle. With specifics, a FreeBSD person can try to reproduce your test results. A grep vs grep --mmap test is pretty straightforward and should be a good test of the VM read-ahead code, but there might always be some unknown circumstance specific to a machine configuration that is the cause of the problem. Repeatability and reproducability by third parties is important when diagnosing any problem. Insofar as MADV_SEQUENTIAL goes... you shouldn't need it on FreeBSD. Unless someone ripped it out since I committed it many years ago, which I doubt, FreeBSD's VM heuristic will figure out that the accesses are sequential and start issuing read-aheads. It should pre-fault, and it should do read-ahead. That isn't to say that there isn't a bug, just that everyone interested in the problem has to be able to reproduce it and help each other track down the source. Just making
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
четвер 23 березень 2006 15:48, Matthew Dillon Ви написали: Well, I don't know about FreeBSD, but both grep cases work just fine on DragonFly. Yes, they both do work fine, but time gives very different stats for each. In my experiments, the total CPU time is noticably less with mmap, but the elapsed time is (much) greater. Here are results from FreeBSD-6.1/amd64 -- notice the large number of page faults, because the system does not try to preload file in the mmap case as it does in the read case: time fgrep meowmeowmeow /home/oh.0.dump 2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w time fgrep --mmap meowmeowmeow /home/oh.0.dump 1.552u 7.109s 2:46.03 5.2% 18+1031k 156+0io 106327pf+0w Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb), I'm sure, you will have no problems reproducing this result. I can't test mzip.c because I don't see the compression library you are calling (maybe that's a FreeBSD thing). The program uses -lz and -lbz2 -- both are parts of FreeBSD since before the unfortunate fork of DF. The following should work for you: make -f bsd.prog.mk LDADD=-lz -lbz2 PROG=mzip mzip Yours, -mi ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:Yes, they both do work fine, but time gives very different stats for each. In :my experiments, the total CPU time is noticably less with mmap, but the :elapsed time is (much) greater. Here are results from FreeBSD-6.1/amd64 -- :notice the large number of page faults, because the system does not try to :preload file in the mmap case as it does in the read case: : : time fgrep meowmeowmeow /home/oh.0.dump : 2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w : time fgrep --mmap meowmeowmeow /home/oh.0.dump : 1.552u 7.109s 2:46.03 5.2% 18+1031k 156+0io 106327pf+0w : :Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb), :I'm sure, you will have no problems reproducing this result. 106,000 page faults. How many pages is a 2.9GB file? If this is running in 64-bit mode those would be 8K pages, right? So that would come to around 380,000 pages. About 1:4. So, clearly the operating system *IS* pre-faulting multiple pages. Since I don't believe that a memory fault would be so inefficient as to account for 80 seconds of run time, it seems more likely to me that the problem is that the VM system is not issuing read-aheads. Not issuing read-aheads would easily account for the 80 seconds. It is possible that the kernel believes the VM system to be too loaded to issue read-aheads, as a consequence of your blowing out of the system caches. It is also possible that the read-ahead code is broken in FreeBSD. To determine which of the two is more likely, you have to run a smaller data set (like 600MB of data on a system with 1GB of ram), and use the unmount/mount trick to clear the cache before each grep test. If the time differential is still huge using the unmount/mount data set test as described above, then the VM system's read-ahead code is broken. If the time differential is tiny, however, then it's probably nothing more then the kernel interpreting your massive 2.9GB mmap as being too stressful on the VM system and disabling read-aheads for that reason. In anycase, this sort of test is not really a good poster child for how to use mmap(). Nobody in their right mind uses mmap() on datasets that they expect to be uncacheable and which are accessed sequentially. It's just plain silly to use mmap() in that sort of circumstance. This is a trueism on ANY operating system, not just FreeBSD. The uncached data set test (using unmount/mount and a dataset which fits into memory) is a far more realistic test because it simulates the most common case encountered by a system under load... the accessing of a reasonably sized data set which happens to not be in the cache. -Matt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
On Thu, Mar 23, 2006 at 03:16:11PM -0800, Matthew Dillon wrote: In anycase, this sort of test is not really a good poster child for how to use mmap(). Nobody in their right mind uses mmap() on datasets that they expect to be uncacheable and which are accessed sequentially. It's just plain silly to use mmap() in that sort of circumstance. This is a trueism on ANY operating system, not just FreeBSD. The uncached data set test (using unmount/mount and a dataset which fits into memory) is a far more realistic test because it simulates the most common case encountered by a system under load... the accessing of a reasonably sized data set which happens to not be in the cache. I thought one serious advantage to this situation for sequential read mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to wait for the clock hands to reap them. On a large Solaris box I used to have the non-pleasure of running the VM page scan rate was high, and I suggested to the app vendor that proper use of mmap might reduce that overhead. Admitedly the files in question were much smaller than the available memory, but they were also not likely to be referenced again before the memory had to be reclaimed forcibly by the VM system. Is that not the case? Is it better to let the VM system reclaim pages as needed? Thanks, Gary ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
: time fgrep meowmeowmeow /home/oh.0.dump : 2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w : time fgrep --mmap meowmeowmeow /home/oh.0.dump : 1.552u 7.109s 2:46.03 5.2% 18+1031k 156+0io 106327pf+0w : :Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb), :I'm sure, you will have no problems reproducing this result. 106,000 page faults. How many pages is a 2.9GB file? If this is running in 64-bit mode those would be 8K pages, right? So that would come to around 380,000 pages. About 1:4. So, clearly the operating system *IS* pre-faulting multiple pages. ... In anycase, this sort of test is not really a good poster child for how to use mmap(). Nobody in their right mind uses mmap() on datasets that they expect to be uncacheable and which are accessed sequentially. It's just plain silly to use mmap() in that sort of circumstance. May be the OS needs reclaim-behind for the sequential case? This way you can mmap many many pages and use a much smaller pool of physical pages to back them. The idea is for the VM to reclaim pages N-k..N-1 when page N is accessed and allow the same process to reuse this page. Similar to read ahead, where the OS schedules read of page N+k, N+k+1.. when page N is accessed. May be even use TCP algorithms to adjust the backing buffer (window) size:-) ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)
:I thought one serious advantage to this situation for sequential read :mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to :wait for the clock hands to reap them. On a large Solaris box I used :to have the non-pleasure of running the VM page scan rate was high, and :I suggested to the app vendor that proper use of mmap might reduce that :overhead. Admitedly the files in question were much smaller than the :available memory, but they were also not likely to be referenced again :before the memory had to be reclaimed forcibly by the VM system. : :Is that not the case? Is it better to let the VM system reclaim pages :as needed? : :Thanks, : :Gary madvise() should theoretically have that effect, but it isn't quite so simple a solution. Lets say you have, oh, your workstation, with 1GB of ram, and you run a program which runs several passes on a 900MB data set. Your X session, xterms, gnome, kde, etc etc etc all take around 300MB of working memory. Now that data set could fit into memory if portions of your UI were pushed out of memory. The question is not only how much of that data set should the kernel fit into memory, but which portions of that data set should the kernel fit into memory and whether the kernel should bump out other data (pieces of your UI) to make it fit. Scenario #1: If the kernel fits the whole 900MB data set into memory, the entire rest of the system would have to compete for the remaining 100MB of memory. Your UI would suck rocks. Scenario #2: If the kernel fits 700MB of the data set into memory, and the rest of the system (your UI, etc) is only using 300MB, and the kernel is using MADV_DONTNEED on pages it has already scanned, now your UI works fine but your data set processing program is continuously accessing the disk for all 900MB of data, on every pass, because the kernel is always only keeping the most recently accessed 700MB of the 900MB data set in memory. Scenario #3: Now lets say the kernel decides to keep just the first 700MB of the data set in memory, and not try to cache the last 200MB of the data set. Now your UI works fine, and your processing program runs FOUR TIMES FASTER because it only has to access the disk for the last 200MB of the 900MB data set. -- Now, which of these scenarios does madvise() cover? Does it cover scenario #1? Well, no. the madvise() call that the program makes has no clue whether you intend to play around with your UI every few minutes, or whether you intend to leave the room for 40 minutes. If the kernel guesses wrong, we wind up with one unhappy user. What about scenario #2? There the program decided to call madvise(), and the system dutifully reuses the pages, and you come back an hour later and your data processing program has only done 10 passes out of the 50 passes it needs to do on the data and you are PISSED. Ok. What about scenario #3? Oops. The program has no way of knowing how much memory you need for your UI to be 'happy'. No madvise() call of any sort will make you happy. Not only that, but the KERNEL has no way of knowing that your data processing program intends to make multiple passes on the data set, whether the working set is represented by one file or several files, and even the data processing program itself might not know (you might be running a script which runs a separate program for each pass on the same data set). So much for madvise(). So, no matter what, there will ALWAYS be an unhappy user somewhere. Lets take Mikhail's grep test as an example. If he runs it over and over again, should the kernel be 'optimized' to realize that the same data set is being scanned sequentially, over and over again, ignore the localized sequential nature of the data accesses, and just keep a dedicated portion of that data set in memory to reduce long term disk access? Should it keep the first 1.5GB, or the last 1.5GB, or perhaps it should slice the data set up and keep every other 256MB block? How does it figure out what to cache and when? What if the program suddenly starts accessing the data in a cacheable way? Maybe it should randomly throw some of the data away slowly in the hopes of 'adapting' to the access pattern, which would also require that it throw away most of the 'recently read' data far more quickly to make up for the data it isn't throwing away. Believe it or not, that actually works for certain types of problems, except then you get hung up in a situation where two subsystems are competing with each other for memory resources (like mail server verses web server), and the system is unable to cope as the relative load factors for the competing subsystems change. The problem becomes really complex really fast. This