Re: Performance of madvise / msync

2008-06-27 Thread Matthew Dillon
:With madvise() and without msync(), there are high numbers of
:faults, which matches the number of disk io operations.  It
:goes through cycles, every once in a while stalling while about
:60MB of data is dumped to disk at 20MB/s or so (buffers flushing?)
:At the beginning of each cycle it's fast, with 140 faults/s or so,
:and slows as the number of faults climbs to 180/s or so before
:stalling and flusing again.  It never gets _really_ slow though.

Yah, without the msync() the dirty pages build up in the kernel's
VM page cache.  A flush should happen automatically every 30-60
seconds, or sooner if the buffer cache builds up too many dirty pages.

The activity you are seeing sounds like the 30-60 second filesystem
sync the kernel does periodically.

Either NetBSD or OpenBSD, I forget which, implemented a partial sync
feature to prevent long stalls when the filesystem syncer hits a file
with a lot of dirty pages.  FreeBSD could borrow that optimization if
they want to reduce stalls from the filesytem sync.  I ported it to DFly
a while back myself.

:With msync() and without madvise(), things are very slow, and
:there are no faults, just writes.
:...
:  The size_t argument to msync() (0x453b7618) is highly questionable.
:  It could be ktrace reporting the wrong value, but maybe not.
:
:That's the size of rg2.rrd.  It's 1161524760 bytes long.
:...
:Looks like the source of my problem is very slow msync() on the
:file when the file is over a certain size.  It's still fastest
:without either madvise or msync.
:
:Thanks for your time,
:
:Marcus

The msync() is clearly the problem.  There are numerous optimizations
in the kernel but msync() is frankly a rather nasty critter even with
the optimizations work.  Nobody using msync() in real life ever tries
to run it over the entirety of such a large mapping... usually it is
just run on explicit sub-ranges that the program wishes to sync.

One reason why msync() is so nasty is that the kernel must physically
check the page table(s) to determine whether a page has been marked dirty
by the MMU, so it can't just iterate the pages it knows are dirty in
the VM object.  It's nasty whether it scans the VM object and iterates
the page tables, or scans the page tables and looks up the related VM
pages.   The only way to optimize this is to force write-faults by
mapping clean pages read-only, in order to track whether a page is
actually dirty in real time instead of lazily.  Then msync() would
only have to do a ranged-scan of the VM object's dirty-page list
and would not have to actually check the page tables for clean pages.

A secondary effect of the msync() is that it is initiating asynchronous
I/O for what sounds like hundreds of VM pages, or even more.  All those
pages are locked and busied from the point they are queued to the point
the I/O finishes, which for some of the pages can be a very, very long
time (into the multiples of seconds).  Pages locked that long will
interfere with madvise() calls made after the msync(), and probably
even interfere with the follow msync().

It used to be that msync() only synced VM pages to the underlying
file, making them consistent with read()'s and write()'s against
the underlying file.  Since FreeBSD uses a unified VM page cache
this is always true.  However, the Open Group specification now
requires that the dirty pages actually be written out to the underlying
media... i.e. issue real I/O.  So msync() can't be a NOP if you go by
the OpenGroup specification.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Performance of madvise / msync

2008-06-26 Thread Marcus Reid
Hi,

I'm using py-rrdtool 0.2.1 with rrdtool 1.3.0 under 7.0-STABLE, and
there's a couple of things about this new version of rrdtool that
hurt performance under FreeBSD, but apparently help on whatever they
tested on.

For every update, the database file is opened, mapped into memory,
madvise() is called, contents are modified, msync() is called, and
the file is unmapped and closed:

   65074 python   0.09 CALL  open(0x831a1b4,O_RDWR,unused0)
   65074 python   0.13 NAMI  rg2.rrd
   65074 python   0.24 RET   open 7
   65074 python   0.07 CALL  fstat(0x7,0xbfbfe428)
   65074 python   0.11 RET   fstat 0
   65074 python   0.08 CALL  
mmap(0,0x453b7618,PROT_READ|PROT_WRITE,MAP_SHARED,0x7,0,0)
   65074 python   0.18 RET   mmap 679235584/0x287c5000
   65074 python   0.07 CALL  madvise(0x287c5000,0x453b7618,_MADV_RANDOM)
   65074 python   0.08 RET   madvise 0
   65074 python   0.06 CALL  madvise(0x287c5000,0x70,_MADV_WILLNEED)
   65074 python   0.027455 RET   madvise 0
   65074 python   0.58 CALL  madvise(0x287c5000,0x1c20,_MADV_WILLNEED)
   65074 python   0.016904 RET   madvise 0
   65074 python   0.000179 CALL  madvise(0x287c6000,0x1950,_MADV_WILLNEED)
   65074 python   0.008629 RET   madvise 0
   65074 python   0.40 CALL  madvise(0x287c8000,0x8,_MADV_WILLNEED)
   65074 python   0.004173 RET   madvise 0
   65074 python   0.48 CALL  gettimeofday(0xbfbfe554,0)
   65074 python   0.09 RET   gettimeofday 0
   65074 python   0.12 CALL  fcntl(0x7,invalid=12,0xbfbfe478)
   65074 python   0.24 RET   fcntl 0
   65074 python   0.006084 CALL  msync(0x287c5000,0x453b7618,MS_ASYNC)
   65074 python   0.106284 RET   msync 0
   65074 python   0.000483 CALL  munmap(0x287c5000,0x453b7618)
   65074 python   0.000356 RET   munmap 0
   65074 python   0.12 CALL  close(0x7)
   65074 python   0.46 RET   close 0
   65074 python   0.000173 CALL  
__sysctl(0xbfbfe388,0x2,0xbfbfe394,0xbfbfe398,0,0)
   65074 python   0.16 RET   __sysctl 0

Here's a similar update without the calls to madvise and msync:

   96372 python   0.11 CALL  open(0x831aa34,O_RDWR,unused0)
   96372 python   0.16 NAMI  rg2.rrd
   96372 python   0.25 RET   open 7
   96372 python   0.09 CALL  fstat(0x7,0xbfbfe428)
   96372 python   0.14 RET   fstat 0
   96372 python   0.10 CALL  
mmap(0,0x453b7618,PROT_READ|PROT_WRITE,MAP_SHARED,0x7,0,0)
   96372 python   0.21 RET   mmap 679235584/0x287c5000
   96372 python   0.000101 CALL  madvise(0x287c5000,0x1c20,_MADV_WILLNEED)
   96372 python   0.13 RET   madvise 0
   96372 python   0.10 CALL  madvise(0x287c6000,0x1950,_MADV_WILLNEED)
   96372 python   0.10 RET   madvise 0
   96372 python   0.09 CALL  madvise(0x287c8000,0x8,_MADV_WILLNEED)
   96372 python   0.09 RET   madvise 0
   96372 python   0.10 CALL  gettimeofday(0xbfbfe554,0)
   96372 python   0.09 RET   gettimeofday 0
   96372 python   0.11 CALL  fcntl(0x7,invalid=12,0xbfbfe478)
   96372 python   0.16 RET   fcntl 0
   96372 python   0.002024 CALL  munmap(0x287c5000,0x453b7618)
   96372 python   0.000366 RET   munmap 0
   96372 python   0.16 CALL  close(0x7)
   96372 python   0.46 RET   close 0
   96372 python   0.000108 CALL  
__sysctl(0xbfbfe388,0x2,0xbfbfe394,0xbfbfe398,0,0)
   96372 python   0.17 RET   __sysctl 0

As you can see, it's quite a bit faster.

I know that msync is necessary under Linux but obsolete under FreeBSD, but
it's still funny that it takes a tenth of a second to return even with
MS_ASYNC specified.

Also, why is it that the madvise() calls take so much longer when the
program does a couple of its own madvise() calls?  Was madvise() never
intended to be run so frequently and is therefore a little slower than
it could be?

Here's the diff between the code for the first kdump above and the
second one.

*** src/rrd_open.c.orig Tue Jun 10 23:12:55 2008
--- src/rrd_open.c  Wed Jun 25 21:43:54 2008
***
*** 175,191 
  #endif
  if (rdwr  RRD_CREAT)
  goto out_done;
- #ifdef USE_MADVISE
- if (rdwr  RRD_COPY) {
- /* We will read everything in a moment (copying) */
- madvise(data, rrd_file-file_len, MADV_WILLNEED | MADV_SEQUENTIAL);
- } else {
- /* We do not need to read anything in for the moment */
- madvise(data, rrd_file-file_len, MADV_RANDOM);
- /* the stat_head will be needed soonish, so hint accordingly */
- madvise(data, sizeof(stat_head_t), MADV_WILLNEED | MADV_RANDOM);
- }
- #endif

  __rrd_read(rrd-stat_head, stat_head_t,
 1);
--- 175,180 
***
*** 388,396 
  int   ret;

  #ifdef HAVE_MMAP
- ret = msync(rrd_file-file_start, rrd_file-file_len, MS_ASYNC);
- if (ret != 0)
- rrd_set_error(msync rrd_file: %s, rrd_strerror(errno));
  ret = munmap(rrd_file-file_start, rrd_file-file_len);
  if (ret != 0)
  rrd_set_error(munmap rrd_file: %s, 

Re: Performance of madvise / msync

2008-06-26 Thread Matthew Dillon
:   65074 python   0.06 CALL  madvise(0x287c5000,0x70,_MADV_WILLNEED)
:   65074 python   0.027455 RET   madvise 0
:   65074 python   0.58 CALL  madvise(0x287c5000,0x1c20,_MADV_WILLNEED)
:   65074 python   0.016904 RET   madvise 0
:   65074 python   0.000179 CALL  madvise(0x287c6000,0x1950,_MADV_WILLNEED)
:   65074 python   0.008629 RET   madvise 0
:   65074 python   0.40 CALL  madvise(0x287c8000,0x8,_MADV_WILLNEED)
:   65074 python   0.004173 RET   madvise 0
:...
:   65074 python   0.006084 CALL  msync(0x287c5000,0x453b7618,MS_ASYNC)
:   65074 python   0.106284 RET   msync 0
:...
:As you can see, it's quite a bit faster.
:
:I know that msync is necessary under Linux but obsolete under FreeBSD, but
:it's still funny that it takes a tenth of a second to return even with
:MS_ASYNC specified.
:
:Also, why is it that the madvise() calls take so much longer when the
:program does a couple of its own madvise() calls?  Was madvise() never
:intended to be run so frequently and is therefore a little slower than
:it could be?
:
:Here's the diff between the code for the first kdump above and the
:second one.

 Those times are way way too large, even with other running threads
 in the system.  madvise() should not take that long unless it is
 being forced to wait on a busied page, and neither should msync().
 madvise() doesn't even do any I/O (or shouldn't anyhow).

 Try removing just the msync() but keep the madvise() calls and see
 if the madvise() calls continue to take horrendous amounts of time.
 Then try the vise-versa.

 It kinda feels like a prior msync() is initiating physical I/O on
 pages and a later mmap/madvise or page fault is being forced to
 wait on the prior pages for the I/O to finish.

 The size_t argument to msync() (0x453b7618) is highly questionable.
 It could be ktrace reporting the wrong value, but maybe not.
 On any sort of random writing test, particularly if multiple threads
 are involved, specifying a size that large could result in very large
 latencies.

-Matt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Performance of madvise / msync

2008-06-26 Thread Marcus Reid
On Thu, Jun 26, 2008 at 05:48:13PM -0700, Matthew Dillon wrote:
 :   65074 python   0.06 CALL  madvise(0x287c5000,0x70,_MADV_WILLNEED)
 :   65074 python   0.027455 RET   madvise 0
 :   65074 python   0.58 CALL  madvise(0x287c5000,0x1c20,_MADV_WILLNEED)
 :   65074 python   0.016904 RET   madvise 0
 :   65074 python   0.000179 CALL  madvise(0x287c6000,0x1950,_MADV_WILLNEED)
 :   65074 python   0.008629 RET   madvise 0
 :   65074 python   0.40 CALL  madvise(0x287c8000,0x8,_MADV_WILLNEED)
 :   65074 python   0.004173 RET   madvise 0
 :...
 :   65074 python   0.006084 CALL  msync(0x287c5000,0x453b7618,MS_ASYNC)
 :   65074 python   0.106284 RET   msync 0
 :...
 :As you can see, it's quite a bit faster.
 :
 :I know that msync is necessary under Linux but obsolete under FreeBSD, but
 :it's still funny that it takes a tenth of a second to return even with
 :MS_ASYNC specified.
 :
 :Also, why is it that the madvise() calls take so much longer when the
 :program does a couple of its own madvise() calls?  Was madvise() never
 :intended to be run so frequently and is therefore a little slower than
 :it could be?
 :
 :Here's the diff between the code for the first kdump above and the
 :second one.
 
  Those times are way way too large, even with other running threads
  in the system.  madvise() should not take that long unless it is
  being forced to wait on a busied page, and neither should msync().
  madvise() doesn't even do any I/O (or shouldn't anyhow).
 
  Try removing just the msync() but keep the madvise() calls and see
  if the madvise() calls continue to take horrendous amounts of time.
  Then try the vise-versa.

Ok, first off, I'm noticing that of the 4 other files that this
is doing the same operations on, sized 569, 940, 116 and 116mB,
all of the msync() and madvise() calls are nice and fast.  It's
only with the 1161524760 byte file that msync is much, much
slower.  It's not linear -- it hits a wall somewhere between
940 and 1161 million bytes.

With madvise() and without msync(), there are high numbers of
faults, which matches the number of disk io operations.  It
goes through cycles, every once in a while stalling while about
60MB of data is dumped to disk at 20MB/s or so (buffers flushing?)
At the beginning of each cycle it's fast, with 140 faults/s or so,
and slows as the number of faults climbs to 180/s or so before
stalling and flusing again.  It never gets _really_ slow though.

   36286 python   0.16 NAMI  rg2.rrd
   36286 python   0.25 RET   open 7
   36286 python   0.09 CALL  fstat(0x7,0xbfbfe428)
   36286 python   0.14 RET   fstat 0
   36286 python   0.10 CALL  
mmap(0,0x453b7618,PROT_READ|PROT_WRITE,MAP_SHARED,0x7,0,0)
   36286 python   0.20 RET   mmap 679235584/0x287c5000
   36286 python   0.10 CALL  madvise(0x287c5000,0x453b7618,_MADV_RANDOM)
   36286 python   0.10 RET   madvise 0
   36286 python   0.09 CALL  madvise(0x287c5000,0x70,_MADV_WILLNEED)
   36286 python   0.67 RET   madvise 0
   36286 python   0.16 CALL  madvise(0x287c5000,0x1c20,_MADV_WILLNEED)
   36286 python   0.15 RET   madvise 0
   36286 python   0.19 CALL  madvise(0x287c6000,0x1950,_MADV_WILLNEED)
   36286 python   0.13 RET   madvise 0
   36286 python   0.10 CALL  madvise(0x287c8000,0x8,_MADV_WILLNEED)
   36286 python   0.10 RET   madvise 0
   36286 python   0.12 CALL  gettimeofday(0xbfbfe554,0)
   36286 python   0.10 RET   gettimeofday 0
   36286 python   0.14 CALL  fcntl(0x7,invalid=12,0xbfbfe478)
   36286 python   0.21 RET   fcntl 0
   36286 python   0.040061 CALL  munmap(0x287c5000,0x453b7618)
   36286 python   0.000324 RET   munmap 0
   36286 python   0.16 CALL  close(0x7)
   36286 python   0.44 RET   close 0
   36286 python   0.000113 CALL  
__sysctl(0xbfbfe388,0x2,0xbfbfe394,0xbfbfe398,0,0)
   36286 python   0.18 RET   __sysctl 0

With msync() and without madvise(), things are very slow, and
there are no faults, just writes.

   61609 python   0.16 NAMI  rg2.rrd
   61609 python   0.24 RET   open 7
   61609 python   0.10 CALL  fstat(0x7,0xbfbfe428)
   61609 python   0.13 RET   fstat 0
   61609 python   0.10 CALL  
mmap(0,0x453b7618,PROT_READ|PROT_WRITE,MAP_SHARED,0x7,0,0)
   61609 python   0.21 RET   mmap 679235584/0x287c5000
   61609 python   0.066603 CALL  madvise(0x287c5000,0x1c20,_MADV_WILLNEED)
   61609 python   0.57 RET   madvise 0
   61609 python   0.09 CALL  madvise(0x287c6000,0x1950,_MADV_WILLNEED)
   61609 python   0.10 RET   madvise 0
   61609 python   0.09 CALL  madvise(0x287c8000,0x8,_MADV_WILLNEED)
   61609 python   0.09 RET   madvise 0
   61609 python   0.10 CALL  gettimeofday(0xbfbfe554,0)
   61609 python   0.18 RET   gettimeofday 0
   61609 python   0.14 CALL  fcntl(0x7,invalid=12,0xbfbfe478)
   61609 python   0.26 RET   fcntl 0
   61609 python   0.004044 CALL  msync(0x287c5000,0x453b7618,MS_ASYNC)
   61609 python