Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-28 Thread Peter Jeremy
On Sat, 2006-Mar-25 21:39:27 +1100, Peter Jeremy wrote:
What happens if you simulate read-ahead yourself?  Have your main
program fork and the child access pages slightly ahead of the parent
but do nothing else.

I suspect something like this may be the best approach for your application.

My suggestion would be to split the backup into 3 processes that share
memory.  I wrote a program that is designed to buffer data in what looks
like a big FIFO and dump | myfifo | gzip  file.gz is significantly
faster than dump | gzip  file.gz so I suspect it will help you as well.

Process 1 reads the input file into mmap A.
Process 2 {b,gz}ips's mmap A into mmap B.
Process 3 writes mmap B into the output file.

Process 3 and mmap B may be optional, depending on your target's write
performance.

mmap A could be the real file with process 1 just accessing pages to
force them into RAM.

I'd suggest that each mmap be capable of storing several hundred msec of
data as a minumum (maybe 10MB input and 5MB output, preferably more).
Synchronisation can be done by writing tokens into pipes shared with the
mmap's, optimised by sharing read/write pointers (so you only really need
the tokens when the shared buffer is full/empty).

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-28 Thread Mikhail Teterin
вівторок 28 березень 2006 05:27, Peter Jeremy написав:
 I'd suggest that each mmap be capable of storing several hundred msec of
 data as a minumum (maybe 10MB input and 5MB output, preferably more).
 Synchronisation can be done by writing tokens into pipes shared with the
 mmap's, optimised by sharing read/write pointers (so you only really need
 the tokens when the shared buffer is full/empty).

Thank you very much, Peter, for your suggestions. Unfortunately, I have no 
control whatsoever over the dump-ing part of the process. The dump is done by 
Sybase database servers -- old, clunky, and closed-source software, running 
on slow CPU (but good I/O) Sun hardware.

You are right, of course, that my application (mzip being only part of it) 
needs to keep the dumper and the compressor in sync. Without any cooperation 
from the former, however, I see no other way but to temporarily throttle the 
NFS-bandwidth via firewall, when the compressor falls behind (as can be 
detected by the increased proportion of sys-time, I guess).

Much as I apreciate the (past and future) help and suggestions, I'm not asking 
you, nor the mailing list to solve my particular problem here :-) I only gave 
the details of my need and application to illustrate a missed general 
optimization opportunity in FreeBSD -- reading large files via mmap need not 
be slower than via read. If anything, it should be (slightly) faster.

After many days Matt has finally stated (admitted? ;-):

read() uses a different heuristic then mmap() to implement the
read-ahead. There is also code in there which depresses the page
priority of 'old' already-read pages in the sequential case.

There is no reason not to implement similar smarts in the mmap-handling code
to similarly depress the priority of the in-memory pages in the 
MADV_SEQUENTIAL case, thus freeing more RAM for aggressive read-ahead.

As I admitted before, actually implementing this far exceeds my own 
capabilities, so all I can do is pester, whoever cares, to do it instead :-)
C'mon, guys...

-mi
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-26 Thread Mikhail Teterin
On Saturday 25 March 2006 06:46 pm, Peter Jeremy wrote:
= My guess is that the read-ahead algorithms are working but aren't doing
= enough re-ahead to cope with read a bit, do some cpu-intensive processing
= and repeat at 25MB/sec so you're winding up with a degree of serialisation
= where the I/O and compressing aren't overlapped.  I'm not sure how tunable
= the read-ahead is.

Well, is the MADV_SEQUNTIAL advice, given over the entire mmap-ed region, 
taken into account anywhere in the kernel? The kernel could read-ahead more 
aggressively if it freed the just accessed pages faster, than it does in the 
default case...

Matt wrote in the same thread:
=It is particularly possible when you combine read() with
=mmap because read() uses a different heuristic then mmap() to
=implement the read-ahead.  There is also code in there which depresses
=the page priority of 'old' already-read pages in the sequential case.

Well, thanks for the theoretical confirmation of what I was trying to prove by 
experiments :-) Can this depressing of the old pages in the sequential 
case, that read's implementation already has, be also implemented in mmap's 
case? It may not *always* be, what the mmap-ing program wants, but when the 
said program uses MADV_SEQUENTAIL, it should not be ignored... (Bakul 
understood this point of mine 3 days ago :-)

Peter Jeremy also wrote, in another message:
= I can't test is as-is because it insists on mmap'ing its output and I only
= have one disk and you can't mmap /dev/null.

If you use a well compressible (redundant) file, such as a web-server log, and 
a high enough compression ratio, you can use the same disk for output -- the 
writes will be very infrequent.

Thanks! Yours,

-mi
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Peter Jeremy
On Fri, 2006-Mar-24 10:00:20 -0800, Matthew Dillon wrote:
Ok.  The next test is to NOT do umount/remount and then use a data set
that is ~2x system memory (but can still be mmap'd by grep).  Rerun
the data set multiple times using grep and grep --mmap.

The results here are weird.  With 1GB RAM and a 2GB dataset, the
timings seem to depend on the sequence of operations: reading is
significantly faster, but only when the data was mmap'd previously
There's one outlier that I can't easily explain.

hw.physmem: 932249600
hw.usermem: 815050752
+ ls -l /6_i386/var/tmp/test
-rw-r--r--  1 peter  wheel  2052167894 Mar 25 05:44 /6_i386/var/tmp/test
+ /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
+ /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test

This was done in multi-user on a VTY using a script.  X was running
(and I forgot to kill an xclock) but there shouldn't have been anything
else happening.

grep --mmap followed by grep --mmap:
mm 77.94 real 1.65 user 2.08 sys
mm 78.22 real 1.53 user 2.21 sys
mm 78.34 real 1.55 user 2.21 sys
mm 79.33 real 1.48 user 2.37 sys

grep --mmap followed by grep/read
mr 56.64 real 0.77 user 2.45 sys
mr 56.73 real 0.67 user 2.53 sys
mr 56.86 real 0.68 user 2.60 sys
mr 57.64 real 0.64 user 2.63 sys
mr 57.71 real 0.62 user 2.68 sys
mr 58.04 real 0.63 user 2.59 sys
mr 58.83 real 0.78 user 2.50 sys
mr 59.15 real 0.74 user 2.50 sys

grep/read followed by grep --mmap
rm 75.98 real 1.56 user 2.19 sys
rm 76.06 real 1.50 user 2.29 sys
rm 76.50 real 1.40 user 2.38 sys
rm 77.35 real 1.47 user 2.30 sys
rm 77.49 real 1.39 user 2.44 sys
rm 79.14 real 1.56 user 2.19 sys
rm 88.88 real 1.57 user 2.27 sys

grep/read followed by grep/read
rr 78.00 real 0.69 user 2.74 sys
rr 78.34 real 0.67 user 2.74 sys
rr 79.64 real 0.69 user 2.71 sys
rr 79.69 real 0.73 user 2.75 sys

free and cache pages.  The system would only be allocating ~60MB/s
(or whatever your disk can do), so the pageout thread ought to be able
to keep up.

This is a laptop so the disk can only manage a bit over 25 MB/sec.

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Peter Jeremy
On Fri, 2006-Mar-24 15:18:00 -0500, Mikhail Teterin wrote:
which there is not with the read. Read also requires fairly large buffers in 
the user space to be efficient -- *in addition* to the buffers in the kernel. 

I disagree.  With a filesystem read, the kernel is solely responsible
for handling physical I/O with an efficient buffer size.  The userland
buffers simply amortise the cost of the system call and copyout
overheads.

I'm also quite certain, that fulfulling my demands would add quite a bit of 
complexity to the mmap support in kernel, but hey, that's what the kernel is 
there for :-)

Unfortunately, your patches to implement this seem to have become detached
from your e-mail. :-)

Unlike grep, which seems to use only 32k buffers anyway (and does not use 
madvise -- see attachment), my program mmaps gigabytes of the input file at 
once, trusting the kernel to do a better job at reading the data in the most 
efficient manner :-)

mmap can lend itself to cleaner implementatione because there's no
need to have a nested loop to read buffers and then process them.  You
can mmap then entire file and process it.  The downside is that on a
32-bit architecture, this limits you to processing files that are
somewhat less than 2GB.  The downside is that touching an uncached
page triggers a trap which may not be as efficient as reading a block
of data through the filesystem interface, and I/O errors are delivered
via signals (which may not be as easy to handle).

Peter Jeremy wrote:
 On an amd64 system running about 6-week old -stable, both ['grep' and 'grep 
 --mmap' -mi] behave pretty much identically.

Peter, I read grep's source -- it is not using madvise (because it hurts 
performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care 
to look at my program instead? Thanks:

   http://aldan.algebra.com/mzip.c

fetch: http://aldan.algebra.com/mzip.c: Not Found

I tried writing a program that just mmap'd my entire (2GB) test file
and summed all the longwords in it.  This gave me similar results to
grep.  Setting MADV_SEQUENTIAL and/or MADV_WILLNEED made no noticable
difference.  I suspect something about your code or system is disabling
the mmap read-ahead functionality.

What happens if you simulate read-ahead yourself?  Have your main
program fork and the child access pages slightly ahead of the parent
but do nothing else.

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Mikhail Teterin
On Saturday 25 March 2006 05:39 am, Peter Jeremy wrote:
= On Fri, 2006-Mar-24 15:18:00 -0500, Mikhail Teterin wrote:
= which there is not with the read. Read also requires fairly large
= buffers in the user space to be efficient -- *in addition* to the
= buffers in the kernel. 
= 
= I disagree.  With a filesystem read, the kernel is solely responsible
= for handling physical I/O with an efficient buffer size. The userland
= buffers simply amortise the cost of the system call and copyout
= overheads.

I don't see a disagreement in the above :-) Mmap API can be slightly faster 
than read -- kernel is still responsible for handling physical I/O with an 
efficient buffer size. But instead of copying the data out after reading, it 
can read it directly into the process' memory.

= I'm also quite certain, that fulfulling my demands would add quite a
= bit of complexity to the mmap support in kernel, but hey, that's what the
=  kernel is there for :-)
= 
= Unfortunately, your patches to implement this seem to have become detached
= from your e-mail. :-)

If I manage to *convince* someone, that there is a problem to solve, I'll 
consider it a good contribution to the project...

= mmap can lend itself to cleaner implementatione because there's no
= need to have a nested loop to read buffers and then process them.  You
= can mmap then entire file and process it.  The downside is that on a
= 32-bit architecture, this limits you to processing files that are
= somewhat less than 2GB.

First, only one of our architectures is 32-bit :-) On 64-bit systems, the 
addressable memory (kind of) matches the maximum file size. Second even with 
the loop reading/processing chunks at a time, the implementation is cleaner, 
because it does not need to allocate any memory nor try to guess, which 
buffer size to pick for optimal performance, nor align the buffers on pages 
(which grep is doing, for example, rather hairily).

= The downside is that touching an uncached page triggers a trap which may
= not be as efficient as reading a block of data through the filesystem
= interface, and I/O errors are delivered via signals (which may not be as
= easy to handle).

My point exactly. It does seem to be less efficient *at the moment* and I
am trying to have the kernel support for this cleaner method of reading 
*improved*. By convincing someone with a clue to do it, that is... :-)

= Would you care to look at my program instead? Thanks:
= 
=  http://aldan.algebra.com/mzip.c

I'm sorry, that should be  http://aldan.algebra.com/~mi/mzip.c -- I checked 
this time :-(

= I tried writing a program that just mmap'd my entire (2GB) test file
= and summed all the longwords in it.

The files I'm dealing with are database dumps -- 10-80Gb :-) Maybe, that's,
what triggers some pessimal case?..

Thanks! Yours,

-mi
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Matthew Dillon

:The results here are weird.  With 1GB RAM and a 2GB dataset, the
:timings seem to depend on the sequence of operations: reading is
:significantly faster, but only when the data was mmap'd previously
:There's one outlier that I can't easily explain.
:...
:Peter Jeremy

Really odd.  Note that if your disk can only do 25 MBytes/sec, the
calculation is: 2052167894 / 25MB = ~80 seconds, not ~60 seconds 
as you would expect from your numbers.

So that would imply that the 80 second numbers represent read-ahead,
and the 60 second numbers indicate that some of the data was retained
from a prior run (and not blown out by the sequential reading in the
later run).

This type of situation *IS* possible as a side effect of other
heuristics.  It is particularly possible when you combine read() with
mmap because read() uses a different heuristic then mmap() to
implement the read-ahead.  There is also code in there which depresses
the page priority of 'old' already-read pages in the sequential case.
So, for example, if you do a linear grep of 2GB you might end up with
a cache state that looks like this:

l = low priority page
m = medium priority page
h = high priority page

FILE: [---m]

Then when you rescan using mmap,

FILE: [l--m]
  [--lm]
  [-l-m]
  [l--m]
  [---l---m]
  [--lm]
  [-llHHHmm]
  [lllHHmmm]
  [---H]
  [---mmmHm]

The low priority pages don't bump out the medium priority pages
from the previous scan, so the grep winds up doing read-ahead
until it hits the large swath of pages already cached from the
previous scan, without bumping out those pages.

There is also a heuristic in the system (FreeBSD and DragonFly)
which tries to randomly retain pages.  It clearly isn't working :-)
I need to change it to randomly retain swaths of pages, the
idea being that it should take repeated runs to rebalance the VM cache
rather then allowing a single run to blow it out or allowing a 
static set of pages to be retained indefinitely, which is what your
tests seem to show is occuring.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread John-Mark Gurney
Mikhail Teterin wrote this message on Sat, Mar 25, 2006 at 09:20 -0500:
 = The downside is that touching an uncached page triggers a trap which may
 = not be as efficient as reading a block of data through the filesystem
 = interface, and I/O errors are delivered via signals (which may not be as
 = easy to handle).
 
 My point exactly. It does seem to be less efficient *at the moment* and I
 am trying to have the kernel support for this cleaner method of reading 
 *improved*. By convincing someone with a clue to do it, that is... :-)

I think the thing is that there isn't an easy way to speed up the
faulting of the page, and that is why you are getting such trouble
making people believe that there is a problem...

To convince people that there is a problem, you need to run benchmarks,
and make code modifications to show that yes, something can be done to
improve the performance...

The other useful/interesting number would be to compare system time
between the mmap case and the read case to see how much work the
kernel is doing in each case...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Peter Jeremy
On Sat, 2006-Mar-25 10:29:17 -0800, Matthew Dillon wrote:
Really odd.  Note that if your disk can only do 25 MBytes/sec, the
calculation is: 2052167894 / 25MB = ~80 seconds, not ~60 seconds 
as you would expect from your numbers.

systat was reporting 25-26 MB/sec.  dd'ing the underlying partition gives
27MB/sec (with 24 and 28 for adjacent partions).

This type of situation *IS* possible as a side effect of other
heuristics.  It is particularly possible when you combine read() with
mmap because read() uses a different heuristic then mmap() to
implement the read-ahead.  There is also code in there which depresses
the page priority of 'old' already-read pages in the sequential case.
So, for example, if you do a linear grep of 2GB you might end up with
a cache state that looks like this:

If I've understood you correctly, this also implies that the timing
depends on the previous two scans, not just the previous scan.  I
didn't test all combinations of this but would have expected to see
two distinct sets of mmap/read timings - one for read/mmap/read and
one for mmap/mmap/read.

I need to change it to randomly retain swaths of pages, the
idea being that it should take repeated runs to rebalance the VM cache
rather then allowing a single run to blow it out or allowing a 
static set of pages to be retained indefinitely, which is what your
tests seem to show is occuring.

I dont think this sort of test is a clear indication that something is
wrong.  There's only one active process at any time and it's performing
a sequential read of a large dataset.  In this case, evicting already
cached data to read new data is not necessarily productive (a simple-
minded algorithm will be evicting data this is going to be accessed in
the near future).

Based on the timings, mmap/read case manages to retain ~15% of the file
in cache.  Given the amount of RAM available, the theoretical limit is
about 40% so this isn't too bad.  It would be nicer if both read and
mmap managed this gain, irrespective of how the data had been previously
accessed.

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Peter Jeremy
On Sat, 2006-Mar-25 09:20:13 -0500, Mikhail Teterin wrote:
I'm sorry, that should be  http://aldan.algebra.com/~mi/mzip.c -- I checked 
this time :-(

It doesn't look like it's doing anything especially weird.  As Matt
pointed out, creating files with mmap() is not a good idea because the
syncer can cause massive fragmentation when allocating space.  I can't
test is as-is because it insists on mmap'ing its output and I only
have one disk and you can't mmap /dev/null.

Since your program is already written to mmap the input and output in
pieces, it would be trivial to convert it to use read/write.

= I tried writing a program that just mmap'd my entire (2GB) test file
= and summed all the longwords in it.

The files I'm dealing with are database dumps -- 10-80Gb :-) Maybe, that's,
what triggers some pessimal case?..

I tried generating an 11GB test file and got results consistent with my
previous tests:  grep using read or mmap, as well as mmap'ing the entire
file give similar times with the disk mostly saturated.

I suggest you try converting mzip.c to use read/write and see if the
problem is still present.

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-25 Thread Peter Jeremy
On Fri, 2006-Mar-24 15:18:00 -0500, Mikhail Teterin wrote:
On the machine, where both mzip and the disk run at only 50%, the disk is a 
plain SATA drive (mzip's state goes from RUN to vnread and back).
...
   18 usersLoad  0.46  0.53  0.60  24 ??? 15:15

Mem:KBREALVIRTUAL VN PAGER  SWAP PAGER
Tot   Share  TotShareFree in  out in  out
Act 18338645880 2775855245268   92216 count  240
All 18811885992 1432466k52864 pages 3413
 Interrupts
Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow2252 total
 1 2101  1605 2025  197  4222 2018 251432 wireirq1: 
 atkb
   506156 act irq6: 
 fdc0
 3.0%Sys   0.0%Intr 45.2%User  0.0%Nice 51.9%Idl  1038216 inact   irq15: 
 ata
||||||||||  89252 cache   irq17: 
fwo
= 2964 freeirq20: 
nve
  daefr   irq21: 
 ohc
Namei Name-cacheDir-cache prcfr   241 irq22: 
ehc
Calls hits% hits% 951 react11 irq25: 
 em0
  pdwak   irq29: 
 amr
  618 zfodpdpgs  2000 cpu0: 
 time
Disks   ad4 amrd0 ofodintrn
KB/t  56.79  0.00 %slo-z   200816 buf
tps 241 05143 tfree 8 dirtybuf
MB/s  13.38  0.00  10 desiredvnodes
% busy   47 0   34717 numvnodes
24991 freevnodes

OK.  I _can_ see something like this when I try to compress a big file using
either your program or gzip.  In my case, both the disk % busy and system idle
vary widely but there's typicaly 50-60% disk utilisation and 30-40% CPU idle.
However, systat is reporting 23-25MB/sec (whereas dd peaks at ~30MB/sec) so the
time to gzip the datafile isn't that much different to the time to just read it.

My guess is that the read-ahead algorithms are working but aren't doing enough
re-ahead to cope with read a bit, do some cpu-intensive processing and repeat
at 25MB/sec so you're winding up with a degree of serialisation where the I/O
and compressing aren't overlapped.  I'm not sure how tunable the read-ahead is.

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-24 Thread Peter Jeremy
On Thu, 2006-Mar-23 15:16:11 -0800, Matthew Dillon wrote:
FreeBSD.  To determine which of the two is more likely, you have to
run a smaller data set (like 600MB of data on a system with 1GB of ram),
and use the unmount/mount trick to clear the cache before each grep test.

On an amd64 system running about 6-week old -stable, both behave
pretty much identically.  In both cases, systat reports that the disk
is about 96% busy whilst loading the cache.  In the cache case, mmap
is significantly faster.

The test data is 2 copies of OOo_2.0.2rc2_src.tar.gz concatenated.

turion% ls -l /6_i386/var/tmp/test
-rw-r--r--  1 peter  wheel  586333684 Mar 24 19:24 /6_i386/var/tmp/test
turion% /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
   21.69 real 0.16 user 0.68 sys
  1064  maximum resident set size
82  average shared memory size
95  average unshared data size
   138  average unshared stack size
   119  page reclaims
 0  page faults
 0  swaps
  4499  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
  4497  voluntary context switches
  3962  involuntary context switches

[umount/remount /6_i386/var]

turion% /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
   21.68 real 0.41 user 0.51 sys
  1068  maximum resident set size
80  average shared memory size
93  average unshared data size
   136  average unshared stack size
 17836  page reclaims
 18081  page faults
 0  swaps
23  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 18105  voluntary context switches
   169  involuntary context switches

The speed gain with mmap is clearly evident when the data is cached and
the CPU clock wound right down (99MHz ISO 2200MHz):

turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
   12.15 real 7.98 user 2.95 sys
turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
   12.28 real 7.92 user 2.94 sys
turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
   13.16 real 8.03 user 2.89 sys
turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test 
   17.09 real 6.37 user 8.92 sys
turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
   17.36 real 6.35 user 9.37 sys
turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
   17.54 real 6.37 user 9.39 sys

-- 
Peter Jeremy
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-24 Thread Matthew Dillon

:On an amd64 system running about 6-week old -stable, both behave
:pretty much identically.  In both cases, systat reports that the disk
:is about 96% busy whilst loading the cache.  In the cache case, mmap
:is significantly faster.
:
:...
:turion% ls -l /6_i386/var/tmp/test
:-rw-r--r--  1 peter  wheel  586333684 Mar 24 19:24 /6_i386/var/tmp/test
:turion% /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
:   21.69 real 0.16 user 0.68 sys
:[umount/remount /6_i386/var]
:turion% /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
:   21.68 real 0.41 user 0.51 sys
:The speed gain with mmap is clearly evident when the data is cached and
:the CPU clock wound right down (99MHz ISO 2200MHz):
:...
:-- 
:Peter Jeremy

That pretty much means that the read-ahead algorithm is working.
If it weren't, the disk would not be running at near 100%.

Ok.  The next test is to NOT do umount/remount and then use a data set
that is ~2x system memory (but can still be mmap'd by grep).  Rerun
the data set multiple times using grep and grep --mmap.

If the times for the mmap case blow up relative to the non-mmap case,
then the vm_page_alloc() calls and/or vm_page_count_severe() (and other
tests) in the vm_fault case are causing the read-ahead to drop out.
If this is the case the problem is not in the read-ahead path, but 
probably in the pageout code not maintaining a sufficient number of
free and cache pages.  The system would only be allocating ~60MB/s
(or whatever your disk can do), so the pageout thread ought to be able
to keep up.

If the times for the mmap case do not blow up, we are back to square
one and I would start investigating the disk driver that Mikhail is
using.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-24 Thread Mikhail Teterin
Matthew Dillon wrote:
   It is possible that the kernel believes the VM system to be too loaded
   to issue read-aheads, as a consequence of your blowing out of the system
   caches.

See attachment for the snapshot of `systat 1 -vm' -- it stays like that for 
the most of the compression run time with only occasional flushes to the 
amrd0 device (the destination for the compressed output).

Bakul Shah followed up:

 May be the OS needs reclaim-behind for the sequential case?
 This way you can mmap many many pages and use a much smaller
 pool of physical pages to back them.  The idea is for the VM
 to reclaim pages N-k..N-1 when page N is accessed and allow
 the same process to reuse this page.

Although it may hard for the kernel to guess, which pages it can reclaim 
efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL 
should've given it a strong hint.

It is for this reasons, that I very much prefer the mmap API to read/write 
(against Matt's repeated advice) -- there is a way to advise the kernel, 
which there is not with the read. Read also requires fairly large buffers in 
the user space to be efficient -- *in addition* to the buffers in the kernel. 
Managing such buffers properly makes the program far messier _and_ 
OS-dependent, than using the mmap interface has to be.

I totally agree with Matt, that FreeBSD's (and probably DragonFly's too) mmap 
interface is better than others', but, it seems to me, there is plenty of 
room for improvement. Reading via mmap should never be slower, than via read 
-- it should be just a notch faster, in fact...

I'm also quite certain, that fulfulling my demands would add quite a bit of 
complexity to the mmap support in kernel, but hey, that's what the kernel is 
there for :-)

Unlike grep, which seems to use only 32k buffers anyway (and does not use 
madvise -- see attachment), my program mmaps gigabytes of the input file at 
once, trusting the kernel to do a better job at reading the data in the most 
efficient manner :-)

Peter Jeremy wrote:
 On an amd64 system running about 6-week old -stable, both ['grep' and 'grep 
 --mmap' -mi] behave pretty much identically.

Peter, I read grep's source -- it is not using madvise (because it hurts 
performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care 
to look at my program instead? Thanks:

http://aldan.algebra.com/mzip.c

(link with -lz and -lbz2).

Matthew Dillon wrote:
[...]
If the times for the mmap case do not blow up, we are back to square
one and I would start investigating the disk driver that Mikhail is
using.

On the machine, where both mzip and the disk run at only 50%, the disk is a 
plain SATA drive (mzip's state goes from RUN to vnread and back).

Thanks, everyone!

-mi
Index: grep.c
===
RCS file: /home/ncvs/src/gnu/usr.bin/grep/grep.c,v
retrieving revision 1.31.2.1
diff -U2 -r1.31.2.1 grep.c
--- grep.c	26 Oct 2005 21:13:30 -	1.31.2.1
+++ grep.c	24 Mar 2006 19:52:05 -
@@ -427,9 +427,8 @@
 		PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED,
 		bufdesc, bufoffset)
-	  != (caddr_t) -1))
+	  != MAP_FAILED))
 	{
-	  /* Do not bother to use madvise with MADV_SEQUENTIAL or
-	 MADV_WILLNEED on the mmapped memory.  One might think it
-	 would help, but it slows us down about 30% on SunOS 4.1.  */
+	  if (madvise(readbuf, mmapsize, MADV_SEQUENTIAL))
+		warn(madvise);
 	  fillsize = mmapsize;
 	}
@@ -441,4 +440,6 @@
 	 other process has an advisory read lock on the file.
 	 There's no point alarming the user about this misfeature.  */
+	  if (mmapsize)
+		warn(mmap);
 	  bufmapped = 0;
 	  if (bufoffset != initial_bufoffset
   18 usersLoad  0.46  0.53  0.60  24 бер 15:15

Mem:KBREALVIRTUAL VN PAGER  SWAP PAGER
Tot   Share  TotShareFree in  out in  out
Act 18338645880 2775855245268   92216 count  240
All 18811885992 1432466k52864 pages 3413
 Interrupts
Proc:r  p  d  s  wCsw  Trp  Sys  Int  Sof  Fltcow2252 total
 1 2101  1605 2025  197  4222 2018 251432 wireirq1: atkb
   506156 act irq6: fdc0
 3.0%Sys   0.0%Intr 45.2%User  0.0%Nice 51.9%Idl  1038216 inact   irq15: ata
||||||||||  89252 cache   irq17: fwo
= 2964 freeirq20: nve
  daefr   irq21: ohc
Namei Name-cacheDir-cache prcfr   241 irq22: ehc
Calls hits% hits% 951 react11 irq25: em0
  pdwak   irq29: amr
  618 zfod

Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-24 Thread Bakul Shah
  May be the OS needs reclaim-behind for the sequential case?
  This way you can mmap many many pages and use a much smaller
  pool of physical pages to back them. šThe idea is for the VM
  to reclaim pages N-k..N-1 when page N is accessed and allow
  the same process to reuse this page.
 
 Although it may hard for the kernel to guess, which pages it can reclaim 
 efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL 
 should've given it a strong hint.

Yes, that is what I was saying.  If mmap read can be made as
efficient as the use of read() for this most common case,
there are benefits.  In effect we set up a fifo that rolls
along the mapped address range and the kernel processing and
the user processing are somewhat decoupled.

   Reading via mmap should never be slower, than via read 
 -- it should be just a notch faster, in fact...

Depends on the cost of mostly redundant processing of N
read() syscalls versus the cost of setting up and tearing
down multiple v2p mappings -- presumably page faults
can be avoided if the kernel fills in pages ahead of when
they are first accessed.  The cost of tlbmiss is likely
minor.  Probably the breakeven point is just a few read()
calls.

 I'm also quite certain, that fulfulling my demands would add quite a bit of 
 complexity to the mmap support in kernel, but hey, that's what the kernel is 
 there for :-)

An interesting thought experiment is to assume the system has
*no* read and write calls and see how far you can get with
the present mmap scheme and what extensions are needed to get
back the same functionality.  Yes, assume mmap  friends even
for serial IO!  I am betting that mmap can be simplified.
[Proof by handwaving elided; this screen is too small to fit
my hands :-)]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Mikhail Teterin
вівторок 21 березень 2006 17:48, Matthew Dillon Ви написали:
     Reading via mmap() is very well optimized.

Actually, I can not agree here -- quite the opposite seems true. When running 
locally (no NFS involved) my compressor with the `-1' flag (fast, least 
effective compression), the program easily compresses faster, than it can 
read.

The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. 
I guess, despite the noise I raised on this subject a year ago, reading via 
mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability.

Unlike read, which uses buffering, mmap-reading still does not pre-fault the 
file's pieces in efficiently :-(

Although the program was written to compress files, that are _likely_ still in 
memory, when used with regular files, it exposes the lack of mmap 
optimization.

This should be even more obvious, if you time searching for a string in a 
large file using grep vs. 'grep --mmap'.

Yours,

-mi

http://aldan.algebra.com/~mi/mzip.c
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Matthew Dillon

:Actually, I can not agree here -- quite the opposite seems true. When running 
:locally (no NFS involved) my compressor with the `-1' flag (fast, least 
:effective compression), the program easily compresses faster, than it can 
:read.
:
:The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. 
:I guess, despite the noise I raised on this subject a year ago, reading via 
:mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability.
:
:Unlike read, which uses buffering, mmap-reading still does not pre-fault the 
:file's pieces in efficiently :-(
:
:Although the program was written to compress files, that are _likely_ still in 
:memory, when used with regular files, it exposes the lack of mmap 
:optimization.
:
:This should be even more obvious, if you time searching for a string in a 
:large file using grep vs. 'grep --mmap'.
:
:Yours,
:
:   -mi
:
:http://aldan.algebra.com/~mi/mzip.c

Well, I don't know about FreeBSD, but both grep cases work just fine on
DragonFly.  I can't test mzip.c because I don't see the compression
library you are calling (maybe that's a FreeBSD thing).  The results
of the grep test ought to be similar for FreeBSD since the heuristic
used by both OS's is the same.  If they aren't, something might have
gotten nerfed accidently in the FreeBSD tree.

Here is the cache case test.  mmap is clearly faster (though I would
again caution that this should not be an implicit assumption since
VM fault overheads can rival read() overheads, depending on the
situation).

The 'x1' file in all tests below is simply /usr/share/dict/words
concactenated over and over again to produce a large file.

crater# ls -la x1
-rw-r--r--  1 root  wheel  638228992 Mar 23 11:36 x1
[ machine has 1GB of ram ]

crater# time grep --mmap asdfasf x1
1.000u 0.117s 0:01.11 100.0%10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.976u 0.132s 0:01.13 97.3% 10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.984u 0.140s 0:01.11 100.9%10+41k 0+0io 0pf+0w

crater# time grep asdfasf x1
0.601u 0.781s 0:01.40 98.5% 10+42k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.507u 0.867s 0:01.39 97.8% 10+40k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.562u 0.812s 0:01.43 95.8% 10+41k 0+0io 0pf+0w

crater# iostat 1
[ while grep is running, in order to test the cache case and verify that
  no I/O is occuring once the data has been cached ]


The disk I/O case, which I can test by unmounting and remounting the
partition containing the file in question, then running grep, seems
to be well optimized on DragonFly.  It should be similarly optimized
on FreeBSD since the code that does this optimization is nearly the
same.  In my test, it is clear that the page-fault overhead in the
uncached case is considerably greater then the copying overhead of
a read(), though not by much.  And I would expect that, too.

test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.382u 0.351s 0:10.23 7.1%  55+141k 42+0io 4pf+0w
test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.390u 0.367s 0:10.16 7.3%  48+123k 42+0io 0pf+0w

test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.539u 0.265s 0:10.53 7.5%  36+93k 42+0io 19518pf+0w
test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.617u 0.289s 0:10.47 8.5%  41+105k 42+0io 19518pf+0w
test28# 

test28# iostat 1 during the test showed ~60MBytes/sec for all four tests

Perhaps you should post specifics of the test you are running, as well
as specifics of the results you are getting, such as the actual timing
output instead of a human interpretation of the results.  For that
matter, being an opteron system, were you running the tests on a UP
system or an SMP system?  grep is a single-threaded so on a 2-cpu
system it will show 50% cpu utilization since one cpu will be 
saturated and the other idle.  With specifics, a FreeBSD person can
try to reproduce your test results.

A grep vs grep --mmap test is pretty straightforward and should be
a good test of the VM read-ahead code, but there might always be some
unknown circumstance specific to a machine configuration that is
the cause of the problem.  Repeatability and reproducability by
third parties is important when diagnosing any problem.

Insofar as MADV_SEQUENTIAL goes... you shouldn't need it on FreeBSD.
Unless someone ripped it out since I committed it many years ago, which
I doubt, FreeBSD's VM heuristic will figure out that the accesses
are sequential and start issuing read-aheads.  It should pre-fault, and
it should do read-ahead.  That isn't to say that there isn't a bug, just
that everyone interested in the problem has to be able to reproduce it
and help each other track down the source.  Just making 

Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Mikhail Teterin
четвер 23 березень 2006 15:48, Matthew Dillon Ви написали:
     Well, I don't know about FreeBSD, but both grep cases work just fine on
     DragonFly.

Yes, they both do work fine, but time gives very different stats for each. In 
my experiments, the total CPU time is noticably less with mmap, but the 
elapsed time is (much) greater. Here are results from FreeBSD-6.1/amd64 -- 
notice the large number of page faults, because the system does not try to 
preload file in the mmap case as it does in the read case:

time fgrep meowmeowmeow /home/oh.0.dump
2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w
time fgrep --mmap  meowmeowmeow /home/oh.0.dump
1.552u 7.109s 2:46.03 5.2%  18+1031k 156+0io 106327pf+0w

Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb), 
I'm sure, you will have no problems reproducing this result.

     I can't test mzip.c because I don't see the compression 
     library you are calling (maybe that's a FreeBSD thing).

The program uses -lz and -lbz2 -- both are parts of FreeBSD since before the 
unfortunate fork of DF. The following should work for you:

make -f bsd.prog.mk LDADD=-lz -lbz2 PROG=mzip mzip

Yours,

-mi
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Matthew Dillon

:Yes, they both do work fine, but time gives very different stats for each. In 
:my experiments, the total CPU time is noticably less with mmap, but the 
:elapsed time is (much) greater. Here are results from FreeBSD-6.1/amd64 -- 
:notice the large number of page faults, because the system does not try to 
:preload file in the mmap case as it does in the read case:
:
:   time fgrep meowmeowmeow /home/oh.0.dump
:   2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w
:   time fgrep --mmap  meowmeowmeow /home/oh.0.dump
:   1.552u 7.109s 2:46.03 5.2%  18+1031k 156+0io 106327pf+0w
:
:Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb), 
:I'm sure, you will have no problems reproducing this result.

106,000 page faults.  How many pages is a 2.9GB file?  If this is running
in 64-bit mode those would be 8K pages, right?  So that would come to 
around 380,000 pages.  About 1:4.  So, clearly the operating system 
*IS* pre-faulting multiple pages.  

Since I don't believe that a memory fault would be so inefficient as
to account for 80 seconds of run time, it seems more likely to me that
the problem is that the VM system is not issuing read-aheads.  Not
issuing read-aheads would easily account for the 80 seconds.

It is possible that the kernel believes the VM system to be too loaded
to issue read-aheads, as a consequence of your blowing out of the system
caches.  It is also possible that the read-ahead code is broken in
FreeBSD.  To determine which of the two is more likely, you have to
run a smaller data set (like 600MB of data on a system with 1GB of ram),
and use the unmount/mount trick to clear the cache before each grep test.

If the time differential is still huge using the unmount/mount data set
test as described above, then the VM system's read-ahead code is broken.
If the time differential is tiny, however, then it's probably nothing
more then the kernel interpreting your massive 2.9GB mmap as being
too stressful on the VM system and disabling read-aheads for that
reason.

In anycase, this sort of test is not really a good poster child for how
to use mmap().  Nobody in their right mind uses mmap() on datasets that
they expect to be uncacheable and which are accessed sequentially.  It's
just plain silly to use mmap() in that sort of circumstance.  This is
a trueism on ANY operating system, not just FreeBSD.  The uncached
data set test (using unmount/mount and a dataset which fits into memory)
is a far more realistic test because it simulates the most common case
encountered by a system under load... the accessing of a reasonably sized
data set which happens to not be in the cache.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Gary Palmer
On Thu, Mar 23, 2006 at 03:16:11PM -0800, Matthew Dillon wrote:
 In anycase, this sort of test is not really a good poster child for how
 to use mmap().  Nobody in their right mind uses mmap() on datasets that
 they expect to be uncacheable and which are accessed sequentially.  It's
 just plain silly to use mmap() in that sort of circumstance.  This is
 a trueism on ANY operating system, not just FreeBSD.  The uncached
 data set test (using unmount/mount and a dataset which fits into memory)
 is a far more realistic test because it simulates the most common case
 encountered by a system under load... the accessing of a reasonably sized
 data set which happens to not be in the cache.

I thought one serious advantage to this situation for sequential read
mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to
wait for the clock hands to reap them.  On a large Solaris box I used
to have the non-pleasure of running the VM page scan rate was high, and
I suggested to the app vendor that proper use of mmap might reduce that
overhead.  Admitedly the files in question were much smaller than the
available memory, but they were also not likely to be referenced again
before the memory had to be reclaimed forcibly by the VM system.

Is that not the case?  Is it better to let the VM system reclaim pages
as needed?

Thanks,

Gary
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Bakul Shah
 : time fgrep meowmeowmeow /home/oh.0.dump
 : 2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w
 : time fgrep --mmap  meowmeowmeow /home/oh.0.dump
 : 1.552u 7.109s 2:46.03 5.2%  18+1031k 156+0io 106327pf+0w
 :
 :Use a big enough file to bust the memory caching (oh.0.dump above is 2.9Gb),
  
 :I'm sure, you will have no problems reproducing this result.
 
 106,000 page faults.  How many pages is a 2.9GB file?  If this is running
 in 64-bit mode those would be 8K pages, right?  So that would come to 
 around 380,000 pages.  About 1:4.  So, clearly the operating system 
 *IS* pre-faulting multiple pages.  
...
 
 In anycase, this sort of test is not really a good poster child for how
 to use mmap().  Nobody in their right mind uses mmap() on datasets that
 they expect to be uncacheable and which are accessed sequentially.  It's
 just plain silly to use mmap() in that sort of circumstance. 

May be the OS needs reclaim-behind for the sequential case?
This way you can mmap many many pages and use a much smaller
pool of physical pages to back them.  The idea is for the VM
to reclaim pages N-k..N-1 when page N is accessed and allow
the same process to reuse this page.  Similar to read ahead,
where the OS schedules read of page N+k, N+k+1.. when page N
is accessed.  May be even use TCP algorithms to adjust the
backing buffer (window) size:-)
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS)

2006-03-23 Thread Matthew Dillon

:I thought one serious advantage to this situation for sequential read
:mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to
:wait for the clock hands to reap them.  On a large Solaris box I used
:to have the non-pleasure of running the VM page scan rate was high, and
:I suggested to the app vendor that proper use of mmap might reduce that
:overhead.  Admitedly the files in question were much smaller than the
:available memory, but they were also not likely to be referenced again
:before the memory had to be reclaimed forcibly by the VM system.
:
:Is that not the case?  Is it better to let the VM system reclaim pages
:as needed?
:
:Thanks,
:
:Gary

madvise() should theoretically have that effect, but it isn't quite
so simple a solution.

Lets say you have, oh, your workstation, with 1GB of ram, and you
run a program which runs several passes on a 900MB data set.
Your X session, xterms, gnome, kde, etc etc etc all take around 300MB
of working memory.

Now that data set could fit into memory if portions of your UI were
pushed out of memory.  The question is not only how much of that data
set should the kernel fit into memory, but which portions of that data
set should the kernel fit into memory and whether the kernel should
bump out other data (pieces of your UI) to make it fit.

Scenario #1:  If the kernel fits the whole 900MB data set into memory,
the entire rest of the system would have to compete for the remaining
100MB of memory.  Your UI would suck rocks.

Scenario #2: If the kernel fits 700MB of the data set into memory, and
the rest of the system (your UI, etc) is only using 300MB, and the kernel
is using MADV_DONTNEED on pages it has already scanned, now your UI
works fine but your data set processing program is continuously 
accessing the disk for all 900MB of data, on every pass, because the
kernel is always only keeping the most recently accessed 700MB of
the 900MB data set in memory.

Scenario #3: Now lets say the kernel decides to keep just the first
700MB of the data set in memory, and not try to cache the last 200MB
of the data set.  Now your UI works fine, and your processing program
runs FOUR TIMES FASTER because it only has to access the disk for
the last 200MB of the 900MB data set.

--

Now, which of these scenarios does madvise() cover?  Does it cover
scenario #1?  Well, no.  the madvise() call that the program makes has
no clue whether you intend to play around with your UI every few minutes,
or whether you intend to leave the room for 40 minutes.  If the kernel
guesses wrong, we wind up with one unhappy user.  

What about scenario #2?  There the program decided to call madvise(),
and the system dutifully reuses the pages, and you come back an hour
later and your data processing program has only done 10 passes out
of the 50 passes it needs to do on the data and you are PISSED.

Ok.  What about scenario #3?  Oops.  The program has no way of knowing
how much memory you need for your UI to be 'happy'.  No madvise() call
of any sort will make you happy.  Not only that, but the KERNEL has no
way of knowing that your data processing program intends to make
multiple passes on the data set, whether the working set is represented
by one file or several files, and even the data processing program itself
might not know (you might be running a script which runs a separate
program for each pass on the same data set).

So much for madvise().

So, no matter what, there will ALWAYS be an unhappy user somewhere.  Lets
take Mikhail's grep test as an example.  If he runs it over and over
again, should the kernel be 'optimized' to realize that the same data
set is being scanned sequentially, over and over again, ignore the
localized sequential nature of the data accesses, and just keep a
dedicated portion of that data set in memory to reduce long term
disk access?  Should it keep the first 1.5GB, or the last 1.5GB,
or perhaps it should slice the data set up and keep every other 256MB
block?  How does it figure out what to cache and when?  What if the
program suddenly starts accessing the data in a cacheable way?

Maybe it should randomly throw some of the data away slowly in the hopes
of 'adapting' to the access pattern, which would also require that it
throw away most of the 'recently read' data far more quickly to make
up for the data it isn't throwing away.  Believe it or not, that
actually works for certain types of problems, except then you get hung
up in a situation where two subsystems are competing with each other
for memory resources (like mail server verses web server), and the
system is unable to cope as the relative load factors for the competing
subsystems change.  The problem becomes really complex really fast.

This