On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote: > On 15.12.2011 17:01, Kostik Belousov wrote: > >On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: > >>On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick > >><[email protected]>wrote: > >> > >>>On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: > >>>>On 14.12.2011 22:22, Jeremy Chadwick wrote: > >>>>>On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: > >>>>>>Hi Jeremy, > >>>>>> > >>>>>>This is not hardware problem, I've already checked that. I also ran > >>>>>>fsck today and got no errors. > >>>>>> > >>>>>>After some more exploration of how mongodb works, I found that then > >>>>>>listing hangs, one of mongodb thread is in "biowr" state for a long > >>>>>>time. It periodically calls msync(MS_SYNC) accordingly to ktrace > >>>>>>out. > >>>>>> > >>>>>>If I'll remove msync() calls from mongodb, how often data will be > >>>>>>sync by OS? > >>>>>> > >>>>>>-- > >>>>>>Andrey Zonov > >>>>>> > >>>>>>On 14.12.2011 2:15, Jeremy Chadwick wrote: > >>>>>>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: > >>>>>>>> > >>>>>>>>Have you any ideas what is going on? or how to catch the problem? > >>>>>>> > >>>>>>>Assuming this isn't a file on the root filesystem, try booting the > >>>>>>>machine in single-user mode and using "fsck -f" on the filesystem in > >>>>>>>question. > >>>>>>> > >>>>>>>Can you verify there's no problems with the disk this file lives on > >>>>>>>as > >>>>>>>well (smartctl -a /dev/disk)? I'm doubting this is the problem, but > >>>>>>>thought I'd mention it. > >>>>> > >>>>>I have no real answer, I'm sorry. msync(2) indicates it's effectively > >>>>>deprecated (see BUGS). It looks like this is effectively a > >>>>>mmap-version > >>>>>of fsync(2). > >>>> > >>>>I replaced msync(2) with fsync(2). Unfortunately, from man pages it > >>>>is not obvious that I can do this. Anyway, thanks. > >>> > >>>Sorry, that wasn't what I was implying. Let me try to explain > >>>differently. > >>> > >>>msync(2) looks, to me, like an mmap-specific version of fsync(2). Based > >>>on the man page, it seems that the with msync() you can effectively > >>>guaranteed flushing of certain pages within an mmap()'d region to disk. > >>>fsync() would flush **all** buffers/internal pages to be flushed to > >>>disk. > >>> > >>>One would need to look at the code to mongodb to find out what it's > >>>actually doing with msync(). That is to say, if it's doing something > >>>like this (I probably have the semantics wrong -- I've never spent much > >>>time with mmap()): > >>> > >>>fd = open("/some/file", O_RDWR); > >>>ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); > >>>ret = msync(ptr, 65536, MS_SYNC); > >>>/* or alternatively, this: > >>>ret = msync(ptr, NULL, MS_SYNC); > >>>*/ > >>> > >>>Then this, to me, would be mostly the equivalent to: > >>> > >>>fd = fopen("/some/file", "r+"); > >>>ret = fsync(fd); > >>> > >>>Otherwise, if it's calling msync() only on an address/location within > >>>the region ptr points to, then that may be more efficient (less pages to > >>>flush). > >>> > >> > >>They call msync() for the whole file. So, there will not be any > >>difference. > >> > >> > >>>The mmap() arguments -- specifically flags (see man page) -- also play > >>>a role here. The one that catches my attention is MAP_NOSYNC. So you > >>>may need to look at the mongodb code to figure out what it's mmap() > >>>call is. > >>> > >>>One might wonder why they don't just use open() with the O_SYNC. I > >>>imagine that has to do with, again, performance; possibly the don't want > >>>all I/O synchronous, and would rather flush certain pages in the mmap'd > >>>region to disk as needed. I see the legitimacy in that approach (vs. > >>>just using O_SYNC). > >>> > >>>There's really no easy way for me to tell you which is more efficient, > >>>better, blah blah without spending a lot of time with a benchmarking > >>>program that tests all of this, *plus* an entire system (world) built > >>>with profiling. > >>> > >> > >>I ran for two hours mongodb with fsync() and got the following: > >>STARTED INBLK OUBLK MAJFLT MINFLT > >>Thu Dec 15 10:34:52 2011 3 192744 314 3080182 > >> > >>This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. > >> > >>Then I ran it with default msync(): > >>STARTED INBLK OUBLK MAJFLT MINFLT > >>Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 > >> > >>There are also two graphics of disk business [1] [2]. > >> > >>The difference is significant, in 37 times! That what I expected to get. > >> > >>In commentaries for vm_object_page_clean() I found this: > >> > >> * When stuffing pages asynchronously, allow clustering. XXX we > >> need a > >> * synchronous clustering mode implementation. > >> > >>It means for me that msync(MS_SYNC) flush every page on disk in single IO > >>transaction. If we multiply 4K and 37 we get 150K. This number is size > >>of > >>the single transaction in my experience. > >> > >>+alc@, kib@ > >> > >>Am I right? Is there any plan to implement this? > >Current buffer clustering code can only do only async writes. In fact, I > >am not quite sure what would consitute the sync clustering, because the > >ability to delay the write is important to be able to cluster at all. > > > >Also, I am not sure that lack of clustering is the biggest problem. > >IMO, the fact that each write is sync is the first problem there. It > >would be quite a work to add the tracking of the issued writes to the > >vm_object_page_clean() and down the stack. Esp. due to custom page > >write vops in several fses. > > > >The only guarantee that POSIX requires from msync(MS_SYNC) is that > >the writes are finished when the syscall returned, and not that the > >writes are done synchronously. Below is the hack which should help if > >the msync()ed region contains the mapping of the whole file, since > >it is possible to fsync() the file after all writes are scheduled > >asynchronous then. It will causes unneeded metadata update, but I think > >it would be much faster still. > > > > > >diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c > >index 250b769..a9de554 100644 > >--- a/sys/vm/vm_object.c > >+++ b/sys/vm/vm_object.c > >@@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t > >offset, vm_size_t size, > > vm_object_t backing_object; > > struct vnode *vp; > > struct mount *mp; > >- int flags; > >+ int flags, fsync_after; > > > > if (object == NULL) > > return; > >@@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t > >offset, vm_size_t size, > > (void) vn_start_write(vp,&mp, V_WAIT); > > vfslocked = VFS_LOCK_GIANT(vp->v_mount); > > vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); > >- flags = (syncio || invalidate) ? OBJPC_SYNC : 0; > >- flags |= invalidate ? OBJPC_INVAL : 0; > >+ if (syncio&& !invalidate&& offset == 0&& > >+ OFF_TO_IDX(size) == object->size) { > >+ /* > >+ * If syncing the whole mapping of the file, > >+ * it is faster to schedule all the writes in > >+ * async mode, also allowing the clustering, > >+ * and then wait for i/o to complete. > >+ */ > >+ flags = 0; > >+ fsync_after = TRUE; > >+ } else { > >+ flags = (syncio || invalidate) ? OBJPC_SYNC : 0; > >+ flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0; > >+ fsync_after = FALSE; > >+ } > > VM_OBJECT_LOCK(object); > > vm_object_page_clean(object, offset, offset + size, flags); > > VM_OBJECT_UNLOCK(object); > >+ if (fsync_after) > >+ (void) VOP_FSYNC(vp, MNT_WAIT, curthread); > > VOP_UNLOCK(vp, 0); > > VFS_UNLOCK_GIANT(vfslocked); > > vn_finished_write(mp); > > Thanks, this patch works. Performance is the same as of using fsync(). > > Actually, Linux uses fsync() inside of msync() if MS_SYNC is set. > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD > I see, indeed Linux fully fsync the whole file if even single page of it appeared to be (non-shadowed) mmaped into the msync(MS_SYNC) region. I am not sure that we shall follow this behaviour.
Alan, do you agree with the patch above ?
pgpRXEFEdSq0Z.pgp
Description: PGP signature
