RE: FileStore should not use syncfs(2)

2015-08-07 Thread Chen, Xiaoxi
 FWIW, I often see performance increase when favoring inode/dentry cache, but
 probably with far fewer inodes that the setup you just saw.  It sounds like 
 there
 needs to be some maximum limit on the inode/dentry cache to prevent this
 kind of behavior but still favor it up until that point.  Having said that, 
 maybe
 avoiding syncfs is best as you say below.

We also see that in most of the case. Usually we set this to 1 (prefer inode) 
as a tuning BKM for small file storage.

Can we walk around it by enlarge the size of FDCache and tune 
/proc/sys/vm/vfs_cache_pressure to 100(prefer data)? That means we try to use 
FDCache to replace inode/dentry cache as possible.



 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
 ow...@vger.kernel.org] On Behalf Of Mark Nelson
 Sent: Thursday, August 6, 2015 5:56 AM
 To: Sage Weil; somnath@sandisk.com
 Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
 Subject: Re: FileStore should not use syncfs(2)
 
 
 
 On 08/05/2015 04:26 PM, Sage Weil wrote:
  Today I learned that syncfs(2) does an O(n) search of the superblock's
  inode list searching for dirty items.  I've always assumed that it was
  only traversing dirty inodes (e.g., a list of dirty inodes), but that
  appears not to be the case, even on the latest kernels.
 
  That means that the more RAM in the box, the larger (generally) the
  inode cache, the longer syncfs(2) will take, and the more CPU you'll
  waste doing it.  The box I was looking at had 256GB of RAM, 36 OSDs,
  and a load of ~40 servicing a very light workload, and each syncfs(2)
  call was taking ~7 seconds (usually to write out a single inode).
 
  A possible workaround for such boxes is to turn
  /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors
  caching pages instead of inodes/dentries)...
 
 FWIW, I often see performance increase when favoring inode/dentry cache, but
 probably with far fewer inodes that the setup you just saw.  It sounds like 
 there
 needs to be some maximum limit on the inode/dentry cache to prevent this
 kind of behavior but still favor it up until that point.  Having said that, 
 maybe
 avoiding syncfs is best as you say below.
 
 
  I think the take-away though is that we do need to bite the bullet and
  make FileStore f[data]sync all the right things so that the syncfs
  call can be avoided.  This is the path you were originally headed
  down, Somnath, and I think it's the right one.
 
  The main thing to watch out for is that according to POSIX you really
  need to fsync directories.  With XFS that isn't the case since all
  metadata operations are going into the journal and that's fully
  ordered, but we don't want to allow data loss on e.g. ext4 (we need to
  check what the metadata ordering behavior is there) or other file systems.
 
  :(
 
  sage
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org More majordomo
  info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body
 of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Yan, Zheng
On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote:
 Today I learned that syncfs(2) does an O(n) search of the superblock's
 inode list searching for dirty items.  I've always assumed that it was
 only traversing dirty inodes (e.g., a list of dirty inodes), but that
 appears not to be the case, even on the latest kernels.


I checked syncfs code in 3.10/4.1 kernel. I think both kernels only
traverse dirty inodes (inodes in
bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing?


 That means that the more RAM in the box, the larger (generally) the inode
 cache, the longer syncfs(2) will take, and the more CPU you'll waste doing
 it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40
 servicing a very light workload, and each syncfs(2) call was taking ~7
 seconds (usually to write out a single inode).

 A possible workaround for such boxes is to turn
 /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching
 pages instead of inodes/dentries)...

 I think the take-away though is that we do need to bite the bullet and
 make FileStore f[data]sync all the right things so that the syncfs call
 can be avoided.  This is the path you were originally headed down,
 Somnath, and I think it's the right one.

 The main thing to watch out for is that according to POSIX you really need
 to fsync directories.  With XFS that isn't the case since all metadata
 operations are going into the journal and that's fully ordered, but we
 don't want to allow data loss on e.g. ext4 (we need to check what the
 metadata ordering behavior is there) or other file systems.

 :(

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Christoph Hellwig
On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote:
 Today I learned that syncfs(2) does an O(n) search of the superblock's 
 inode list searching for dirty items.  I've always assumed that it was 
 only traversing dirty inodes (e.g., a list of dirty inodes), but that 
 appears not to be the case, even on the latest kernels.

I'm pretty sure Dave had some patches for that,  Even if they aren't
included it's not an unsolved problem.

 The main thing to watch out for is that according to POSIX you really need 
 to fsync directories.  With XFS that isn't the case since all metadata 
 operations are going into the journal and that's fully ordered, but we 
 don't want to allow data loss on e.g. ext4 (we need to check what the 
 metadata ordering behavior is there) or other file systems.

That additional fsync in XFS is basically free, so better get it right
and let the file system micro optimize for you.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, Haomai Wang wrote:
 Agree
 
 On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy somnath@sandisk.com wrote:
  Thanks Sage for digging down..I was suspecting something similar.. As I 
  mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 
  64 GB of RAM in the system.
  The workaround I was talking about today  is working pretty good so far. In 
  this implementation, I am not giving much work to syncfs as each worker 
  thread is writing with o_dsync mode. I am issuing syncfs before trimming 
  the journal and most of the time I saw it is taking  100 ms.
 
 Actually I prefer we don't use syncfs anymore. I more like to use
 aio+dio+Filestore custom cache to deal with all syncfs+pagecache
 things. So we even can make cache more smart to aware of upper levels
 instead of fadvise* calls. Second we can use checkpoint method like
 mysql innodb, we can know the bw of frontend(filejournal) and decide
 how much and how often we want to flush(using aio+dio).
 
 Anyway, because it's a big project, we may prefer to work at newstore
 instead of filestore.
 
  I have to wake up the sync_thread now after each worker thread finished 
  writing. I will benchmark both the approaches. As we discussed earlier, in 
  case of only fsync approach, we still need to do a db sync to make sure the 
  leveldb stuff persisted, right ?
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Sage Weil [mailto:sw...@redhat.com]
  Sent: Wednesday, August 05, 2015 2:27 PM
  To: Somnath Roy
  Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
  Subject: FileStore should not use syncfs(2)
 
  Today I learned that syncfs(2) does an O(n) search of the superblock's 
  inode list searching for dirty items.  I've always assumed that it was only 
  traversing dirty inodes (e.g., a list of dirty inodes), but that appears 
  not to be the case, even on the latest kernels.
 
  That means that the more RAM in the box, the larger (generally) the inode 
  cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
  it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
  servicing a very light workload, and each syncfs(2) call was taking ~7 
  seconds (usually to write out a single inode).
 
  A possible workaround for such boxes is to turn 
  /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
  pages instead of inodes/dentries)...
 
  I think the take-away though is that we do need to bite the bullet and make 
  FileStore f[data]sync all the right things so that the syncfs call can be 
  avoided.  This is the path you were originally headed down, Somnath, and I 
  think it's the right one.
 
  The main thing to watch out for is that according to POSIX you really need 
  to fsync directories.  With XFS that isn't the case since all metadata 
  operations are going into the journal and that's fully ordered, but we 
  don't want to allow data loss on e.g. ext4 (we need to check what the 
  metadata ordering behavior is there) or other file systems.
 
 I guess there only a little directory modify operations, is it true?
 Maybe we only need to do syncfs when modifying directories?

I'd say there are a few broad cases:

 - creating or deleting objects.  simply fsyncing the file is 
sufficient on XFS; we should confirm what the behavior is on other 
distros.  But even if we d the fsync on the dir this is simple to 
implement.

 - renaming objects (collection_move_rename).  Easy to add an fsync here.

 - HashIndex rehashing.  This is where I get nervous... and setting some 
flag that triggers a full syncfs might be an interim solution since it's a 
pretty rare event.  OTOH, adding the fsync calls in the HashIndex code 
probably isn't so bad to audit and get right either...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, Christoph Hellwig wrote:
 On Wed, Aug 05, 2015 at 02:26:30PM -0700, Sage Weil wrote:
  Today I learned that syncfs(2) does an O(n) search of the superblock's 
  inode list searching for dirty items.  I've always assumed that it was 
  only traversing dirty inodes (e.g., a list of dirty inodes), but that 
  appears not to be the case, even on the latest kernels.
 
 I'm pretty sure Dave had some patches for that,  Even if they aren't
 included it's not an unsolved problem.
 
  The main thing to watch out for is that according to POSIX you really need 
  to fsync directories.  With XFS that isn't the case since all metadata 
  operations are going into the journal and that's fully ordered, but we 
  don't want to allow data loss on e.g. ext4 (we need to check what the 
  metadata ordering behavior is there) or other file systems.
 
 That additional fsync in XFS is basically free, so better get it right
 and let the file system micro optimize for you.

I'm guessing the strategy here should be to fsync the file (leaf) and then 
any affected ancestors, such that the directory fsyncs are effectively 
no-ops?  Or does it matter?

Thanks!
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Sage Weil
On Thu, 6 Aug 2015, Yan, Zheng wrote:
 On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote:
  Today I learned that syncfs(2) does an O(n) search of the superblock's
  inode list searching for dirty items.  I've always assumed that it was
  only traversing dirty inodes (e.g., a list of dirty inodes), but that
  appears not to be the case, even on the latest kernels.
 
 
 I checked syncfs code in 3.10/4.1 kernel. I think both kernels only
 traverse dirty inodes (inodes in
 bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing?

See wait_sb_inodes in fs/fs-writeback.c, called by sync_inodes_sb.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Christoph Hellwig
On Thu, Aug 06, 2015 at 06:00:42AM -0700, Sage Weil wrote:
 I'm guessing the strategy here should be to fsync the file (leaf) and then 
 any affected ancestors, such that the directory fsyncs are effectively 
 no-ops?  Or does it matter?

All metadata transactions log the involve parties (parent and child
inode(s) mostly) in the same transaction.  So flushing one of them out
is enough.  But file data I/O might dirty the inode before flushing them
out, so to not need to write out the inode log item twice you first want
to fsync any file that had data I/O followed by directories or special
files that only had metadata modified.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FileStore should not use syncfs(2)

2015-08-05 Thread Sage Weil
Today I learned that syncfs(2) does an O(n) search of the superblock's 
inode list searching for dirty items.  I've always assumed that it was 
only traversing dirty inodes (e.g., a list of dirty inodes), but that 
appears not to be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode 
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
servicing a very light workload, and each syncfs(2) call was taking ~7 
seconds (usually to write out a single inode).

A possible workaround for such boxes is to turn 
/proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
pages instead of inodes/dentries)...

I think the take-away though is that we do need to bite the bullet and 
make FileStore f[data]sync all the right things so that the syncfs call 
can be avoided.  This is the path you were originally headed down, 
Somnath, and I think it's the right one.

The main thing to watch out for is that according to POSIX you really need 
to fsync directories.  With XFS that isn't the case since all metadata 
operations are going into the journal and that's fully ordered, but we 
don't want to allow data loss on e.g. ext4 (we need to check what the 
metadata ordering behavior is there) or other file systems.

:(

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: FileStore should not use syncfs(2)

2015-08-05 Thread Somnath Roy
Thanks Sage for digging down..I was suspecting something similar.. As I 
mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 64 
GB of RAM in the system.
The workaround I was talking about today  is working pretty good so far. In 
this implementation, I am not giving much work to syncfs as each worker thread 
is writing with o_dsync mode. I am issuing syncfs before trimming the journal 
and most of the time I saw it is taking  100 ms.
I have to wake up the sync_thread now after each worker thread finished 
writing. I will benchmark both the approaches. As we discussed earlier, in case 
of only fsync approach, we still need to do a db sync to make sure the leveldb 
stuff persisted, right ?

Thanks  Regards
Somnath

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Wednesday, August 05, 2015 2:27 PM
To: Somnath Roy
Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
Subject: FileStore should not use syncfs(2)

Today I learned that syncfs(2) does an O(n) search of the superblock's inode 
list searching for dirty items.  I've always assumed that it was only 
traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to 
be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode 
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it.  
The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing 
a very light workload, and each syncfs(2) call was taking ~7 seconds (usually 
to write out a single inode).

A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure 
way up (so that the kernel favors caching pages instead of inodes/dentries)...

I think the take-away though is that we do need to bite the bullet and make 
FileStore f[data]sync all the right things so that the syncfs call can be 
avoided.  This is the path you were originally headed down, Somnath, and I 
think it's the right one.

The main thing to watch out for is that according to POSIX you really need to 
fsync directories.  With XFS that isn't the case since all metadata operations 
are going into the journal and that's fully ordered, but we don't want to allow 
data loss on e.g. ext4 (we need to check what the metadata ordering behavior is 
there) or other file systems.

:(

sage



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-05 Thread Mark Nelson



On 08/05/2015 04:26 PM, Sage Weil wrote:

Today I learned that syncfs(2) does an O(n) search of the superblock's
inode list searching for dirty items.  I've always assumed that it was
only traversing dirty inodes (e.g., a list of dirty inodes), but that
appears not to be the case, even on the latest kernels.

That means that the more RAM in the box, the larger (generally) the inode
cache, the longer syncfs(2) will take, and the more CPU you'll waste doing
it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40
servicing a very light workload, and each syncfs(2) call was taking ~7
seconds (usually to write out a single inode).

A possible workaround for such boxes is to turn
/proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching
pages instead of inodes/dentries)...


FWIW, I often see performance increase when favoring inode/dentry cache, 
but probably with far fewer inodes that the setup you just saw.  It 
sounds like there needs to be some maximum limit on the inode/dentry 
cache to prevent this kind of behavior but still favor it up until that 
point.  Having said that, maybe avoiding syncfs is best as you say below.




I think the take-away though is that we do need to bite the bullet and
make FileStore f[data]sync all the right things so that the syncfs call
can be avoided.  This is the path you were originally headed down,
Somnath, and I think it's the right one.

The main thing to watch out for is that according to POSIX you really need
to fsync directories.  With XFS that isn't the case since all metadata
operations are going into the journal and that's fully ordered, but we
don't want to allow data loss on e.g. ext4 (we need to check what the
metadata ordering behavior is there) or other file systems.

:(

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-05 Thread Haomai Wang
Agree

On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy somnath@sandisk.com wrote:
 Thanks Sage for digging down..I was suspecting something similar.. As I 
 mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 
 64 GB of RAM in the system.
 The workaround I was talking about today  is working pretty good so far. In 
 this implementation, I am not giving much work to syncfs as each worker 
 thread is writing with o_dsync mode. I am issuing syncfs before trimming the 
 journal and most of the time I saw it is taking  100 ms.

Actually I prefer we don't use syncfs anymore. I more like to use
aio+dio+Filestore custom cache to deal with all syncfs+pagecache
things. So we even can make cache more smart to aware of upper levels
instead of fadvise* calls. Second we can use checkpoint method like
mysql innodb, we can know the bw of frontend(filejournal) and decide
how much and how often we want to flush(using aio+dio).

Anyway, because it's a big project, we may prefer to work at newstore
instead of filestore.

 I have to wake up the sync_thread now after each worker thread finished 
 writing. I will benchmark both the approaches. As we discussed earlier, in 
 case of only fsync approach, we still need to do a db sync to make sure the 
 leveldb stuff persisted, right ?

 Thanks  Regards
 Somnath

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, August 05, 2015 2:27 PM
 To: Somnath Roy
 Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
 Subject: FileStore should not use syncfs(2)

 Today I learned that syncfs(2) does an O(n) search of the superblock's inode 
 list searching for dirty items.  I've always assumed that it was only 
 traversing dirty inodes (e.g., a list of dirty inodes), but that appears not 
 to be the case, even on the latest kernels.

 That means that the more RAM in the box, the larger (generally) the inode 
 cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
 it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
 servicing a very light workload, and each syncfs(2) call was taking ~7 
 seconds (usually to write out a single inode).

 A possible workaround for such boxes is to turn 
 /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
 pages instead of inodes/dentries)...

 I think the take-away though is that we do need to bite the bullet and make 
 FileStore f[data]sync all the right things so that the syncfs call can be 
 avoided.  This is the path you were originally headed down, Somnath, and I 
 think it's the right one.

 The main thing to watch out for is that according to POSIX you really need to 
 fsync directories.  With XFS that isn't the case since all metadata 
 operations are going into the journal and that's fully ordered, but we don't 
 want to allow data loss on e.g. ext4 (we need to check what the metadata 
 ordering behavior is there) or other file systems.

I guess there only a little directory modify operations, is it true?
Maybe we only need to do syncfs when modifying directories?


 :(

 sage

 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html