On Thu, 6 Aug 2015, Haomai Wang wrote:
> Agree
> 
> On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy <somnath....@sandisk.com> wrote:
> > Thanks Sage for digging down..I was suspecting something similar.. As I 
> > mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 
> > 64 GB of RAM in the system.
> > The workaround I was talking about today  is working pretty good so far. In 
> > this implementation, I am not giving much work to syncfs as each worker 
> > thread is writing with o_dsync mode. I am issuing syncfs before trimming 
> > the journal and most of the time I saw it is taking < 100 ms.
> 
> Actually I prefer we don't use syncfs anymore. I more like to use
> "aio+dio+Filestore custom cache" to deal with all "syncfs+pagecache"
> things. So we even can make cache more smart to aware of upper levels
> instead of fadvise* calls. Second we can use "checkpoint" method like
> mysql innodb, we can know the bw of frontend(filejournal) and decide
> how much and how often we want to flush(using aio+dio).
> 
> Anyway, because it's a big project, we may prefer to work at newstore
> instead of filestore.
> 
> > I have to wake up the sync_thread now after each worker thread finished 
> > writing. I will benchmark both the approaches. As we discussed earlier, in 
> > case of only fsync approach, we still need to do a db sync to make sure the 
> > leveldb stuff persisted, right ?
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sw...@redhat.com]
> > Sent: Wednesday, August 05, 2015 2:27 PM
> > To: Somnath Roy
> > Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
> > Subject: FileStore should not use syncfs(2)
> >
> > Today I learned that syncfs(2) does an O(n) search of the superblock's 
> > inode list searching for dirty items.  I've always assumed that it was only 
> > traversing dirty inodes (e.g., a list of dirty inodes), but that appears 
> > not to be the case, even on the latest kernels.
> >
> > That means that the more RAM in the box, the larger (generally) the inode 
> > cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
> > it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
> > servicing a very light workload, and each syncfs(2) call was taking ~7 
> > seconds (usually to write out a single inode).
> >
> > A possible workaround for such boxes is to turn 
> > /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
> > pages instead of inodes/dentries)...
> >
> > I think the take-away though is that we do need to bite the bullet and make 
> > FileStore f[data]sync all the right things so that the syncfs call can be 
> > avoided.  This is the path you were originally headed down, Somnath, and I 
> > think it's the right one.
> >
> > The main thing to watch out for is that according to POSIX you really need 
> > to fsync directories.  With XFS that isn't the case since all metadata 
> > operations are going into the journal and that's fully ordered, but we 
> > don't want to allow data loss on e.g. ext4 (we need to check what the 
> > metadata ordering behavior is there) or other file systems.
> 
> I guess there only a little directory modify operations, is it true?
> Maybe we only need to do syncfs when modifying directories?

I'd say there are a few broad cases:

 - creating or deleting objects.  simply fsyncing the file is 
sufficient on XFS; we should confirm what the behavior is on other 
distros.  But even if we d the fsync on the dir this is simple to 
implement.

 - renaming objects (collection_move_rename).  Easy to add an fsync here.

 - HashIndex rehashing.  This is where I get nervous... and setting some 
flag that triggers a full syncfs might be an interim solution since it's a 
pretty rare event.  OTOH, adding the fsync calls in the HashIndex code 
probably isn't so bad to audit and get right either...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to