On 12/22/2011 03:48, Kostik Belousov wrote:
On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote:
On 15.12.2011 17:01, Kostik Belousov wrote:
On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote:
On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
<[email protected]>wrote:

On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
On 14.12.2011 22:22, Jeremy Chadwick wrote:
On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
Hi Jeremy,

This is not hardware problem, I've already checked that. I also ran
fsck today and got no errors.

After some more exploration of how mongodb works, I found that then
listing hangs, one of mongodb thread is in "biowr" state for a long
time. It periodically calls msync(MS_SYNC) accordingly to ktrace
out.

If I'll remove msync() calls from mongodb, how often data will be
sync by OS?

--
Andrey Zonov

On 14.12.2011 2:15, Jeremy Chadwick wrote:
On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
Have you any ideas what is going on? or how to catch the problem?
Assuming this isn't a file on the root filesystem, try booting the
machine in single-user mode and using "fsck -f" on the filesystem in
question.

Can you verify there's no problems with the disk this file lives on
as
well (smartctl -a /dev/disk)?  I'm doubting this is the problem, but
thought I'd mention it.
I have no real answer, I'm sorry.  msync(2) indicates it's effectively
deprecated (see BUGS).  It looks like this is effectively a
mmap-version
of fsync(2).
I replaced msync(2) with fsync(2).  Unfortunately, from man pages it
is not obvious that I can do this. Anyway, thanks.
Sorry, that wasn't what I was implying.  Let me try to explain
differently.

msync(2) looks, to me, like an mmap-specific version of fsync(2).  Based
on the man page, it seems that the with msync() you can effectively
guaranteed flushing of certain pages within an mmap()'d region to disk.
fsync() would flush **all** buffers/internal pages to be flushed to
disk.

One would need to look at the code to mongodb to find out what it's
actually doing with msync().  That is to say, if it's doing something
like this (I probably have the semantics wrong -- I've never spent much
time with mmap()):

fd = open("/some/file", O_RDWR);
ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
ret = msync(ptr, 65536, MS_SYNC);
/* or alternatively, this:
ret = msync(ptr, NULL, MS_SYNC);
*/

Then this, to me, would be mostly the equivalent to:

fd = fopen("/some/file", "r+");
ret = fsync(fd);

Otherwise, if it's calling msync() only on an address/location within
the region ptr points to, then that may be more efficient (less pages to
flush).

They call msync() for the whole file.  So, there will not be any
difference.


The mmap() arguments -- specifically flags (see man page) -- also play
a role here.  The one that catches my attention is MAP_NOSYNC.  So you
may need to look at the mongodb code to figure out what it's mmap()
call is.

One might wonder why they don't just use open() with the O_SYNC.  I
imagine that has to do with, again, performance; possibly the don't want
all I/O synchronous, and would rather flush certain pages in the mmap'd
region to disk as needed.  I see the legitimacy in that approach (vs.
just using O_SYNC).

There's really no easy way for me to tell you which is more efficient,
better, blah blah without spending a lot of time with a benchmarking
program that tests all of this, *plus* an entire system (world) built
with profiling.

I ran for two hours mongodb with fsync() and got the following:
STARTED                      INBLK OUBLK MAJFLT MINFLT
Thu Dec 15 10:34:52 2011         3 192744    314 3080182

This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.

Then I ran it with default msync():
STARTED                      INBLK OUBLK MAJFLT MINFLT
Thu Dec 15 12:34:53 2011         0 7241555     79 5401945

There are also two graphics of disk business [1] [2].

The difference is significant, in 37 times!  That what I expected to get.

In commentaries for vm_object_page_clean() I found this:

  *      When stuffing pages asynchronously, allow clustering.  XXX we
  need a
  *      synchronous clustering mode implementation.

It means for me that msync(MS_SYNC) flush every page on disk in single IO
transaction.  If we multiply 4K and 37 we get 150K.  This number is size
of
the single transaction in my experience.

+alc@, kib@

Am I right? Is there any plan to implement this?
Current buffer clustering code can only do only async writes. In fact, I
am not quite sure what would consitute the sync clustering, because the
ability to delay the write is important to be able to cluster at all.

Also, I am not sure that lack of clustering is the biggest problem.
IMO, the fact that each write is sync is the first problem there. It
would be quite a work to add the tracking of the issued writes to the
vm_object_page_clean() and down the stack. Esp. due to custom page
write vops in several fses.

The only guarantee that POSIX requires from msync(MS_SYNC) is that
the writes are finished when the syscall returned, and not that the
writes are done synchronously. Below is the hack which should help if
the msync()ed region contains the mapping of the whole file, since
it is possible to fsync() the file after all writes are scheduled
asynchronous then. It will causes unneeded metadata update, but I think
it would be much faster still.


diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 250b769..a9de554 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t
offset, vm_size_t size,
        vm_object_t backing_object;
        struct vnode *vp;
        struct mount *mp;
-       int flags;
+       int flags, fsync_after;

        if (object == NULL)
                return;
@@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t
offset, vm_size_t size,
                (void) vn_start_write(vp,&mp, V_WAIT);
                vfslocked = VFS_LOCK_GIANT(vp->v_mount);
                vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
-               flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
-               flags |= invalidate ? OBJPC_INVAL : 0;
+               if (syncio&&   !invalidate&&   offset == 0&&
+                   OFF_TO_IDX(size) == object->size) {
+                       /*
+                        * If syncing the whole mapping of the file,
+                        * it is faster to schedule all the writes in
+                        * async mode, also allowing the clustering,
+                        * and then wait for i/o to complete.
+                        */
+                       flags = 0;
+                       fsync_after = TRUE;
+               } else {
+                       flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
+                       flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
+                       fsync_after = FALSE;
+               }
                VM_OBJECT_LOCK(object);
                vm_object_page_clean(object, offset, offset + size, flags);
                VM_OBJECT_UNLOCK(object);
+               if (fsync_after)
+                       (void) VOP_FSYNC(vp, MNT_WAIT, curthread);
                VOP_UNLOCK(vp, 0);
                VFS_UNLOCK_GIANT(vfslocked);
                vn_finished_write(mp);
Thanks, this patch works.  Performance is the same as of using fsync().

Actually, Linux uses fsync() inside of msync() if MS_SYNC is set.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD

I see, indeed Linux fully fsync the whole file if even single page of it
appeared to be (non-shadowed) mmaped into the msync(MS_SYNC) region.
I am not sure that we shall follow this behaviour.

Alan, do you agree with the patch above ?

Yes, it's ok.

Alan

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Reply via email to