Thanks Sage for digging down..I was suspecting something similar.. As I 
mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 
GB of RAM in the system.
The workaround I was talking about today  is working pretty good so far. In 
this implementation, I am not giving much work to syncfs as each worker thread 
is writing with o_dsync mode. I am issuing syncfs before trimming the journal 
and most of the time I saw it is taking < 100 ms.
I have to wake up the sync_thread now after each worker thread finished 
writing. I will benchmark both the approaches. As we discussed earlier, in case 
of only fsync approach, we still need to do a db sync to make sure the leveldb 
stuff persisted, right ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, August 05, 2015 2:27 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
Subject: FileStore should not use syncfs(2)

Today I learned that syncfs(2) does an O(n) search of the superblock's inode 
list searching for dirty items.  I've always assumed that it was only 
traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to 
be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode 
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it.  
The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing 
a very light workload, and each syncfs(2) call was taking ~7 seconds (usually 
to write out a single inode).

A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure 
way up (so that the kernel favors caching pages instead of inodes/dentries)...

I think the take-away though is that we do need to bite the bullet and make 
FileStore f[data]sync all the right things so that the syncfs call can be 
avoided.  This is the path you were originally headed down, Somnath, and I 
think it's the right one.

The main thing to watch out for is that according to POSIX you really need to 
fsync directories.  With XFS that isn't the case since all metadata operations 
are going into the journal and that's fully ordered, but we don't want to allow 
data loss on e.g. ext4 (we need to check what the metadata ordering behavior is 
there) or other file systems.

:(

sage

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to