ceph on btrfs [was Re: ceph on non-btrfs file systems]

Sage Weil Mon, 24 Oct 2011 10:07:55 -0700

[adding linux-btrfs to cc]

Josef, Chris, any ideas on the below issues?


On Mon, 24 Oct 2011, Christian Brunner wrote:
> Thanks for explaining this. I don't have any objections against btrfs
> as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> scare me, since I can use the ceph replication to recover a lost
> btrfs-filesystem. The only problem I have is, that btrfs is not stable
> on our side and I wonder what you are doing to make it work. (Maybe
> it's related to the load pattern of using ceph as a backend store for
> qemu).
> 
> Here is a list of the btrfs problems I'm having:
> 
> - When I run ceph with the default configuration (btrfs snaps enabled)
> I can see a rapid increase in Disk-I/O after a few hours of uptime.
> Btrfs-cleaner is using more and more time in
> btrfs_clean_old_snapshots().

In theory, there shouldn't be any significant difference between taking a 
snapshot and removing it a few commits later, and the prior root refs that 
btrfs holds on to internally until the new commit is complete.  That's 
clearly not quite the case, though.

In any case, we're going to try to reproduce this issue in our 
environment.

> - When I run ceph with btrfs snaps disabled, the situation is getting
> slightly better. I can run an OSD for about 3 days without problems,
> but then again the load increases. This time, I can see that the
> ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> than usual.

FYI in this scenario you're exposed to the same journal replay issues that 
ext4 and XFS are.  The btrfs workload that ceph is generating will also 
not be all that special, though, so this problem shouldn't be unique to 
ceph.

> Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
> from time to time. Maybe it's related to the performance issues, but
> seems to be able to verify this.

I haven't seen this yet with the latest stuff from Josef, but others have.  
Josef, is there any information we can provide to help track it down?

> It's really sad to see, that ceph performance and stability is
> suffering that much from the underlying filesystems and that this
> hasn't changed over the last months.

We don't have anyone internally working on btrfs at the moment, and are 
still struggling to hire experienced kernel/fs people.  Josef has been 
very helpful with tracking these issues down, but he hass responsibilities 
beyond just the Ceph related issues.  Progress is slow, but we are 
working on it!

sage


> 
> Kind regards,
> Christian
> 
> 2011/10/24 Sage Weil <s...@newdream.net>:
> > Although running on ext4, xfs, or whatever other non-btrfs you want mostly
> > works, there are a few important remaining issues:
> >
> > 1- ext4 limits total xattrs for 4KB.  This can cause problems in some
> > cases, as Ceph uses xattrs extensively.  Most of the time we don't hit
> > this.  We do hit the limit with radosgw pretty easily, though, and may
> > also hit it in exceptional cases where the OSD cluster is very unhealthy.
> >
> > There is a large xattr patch for ext4 from the Lustre folks that has been
> > floating around for (I think) years.  Maybe as interest grows in running
> > Ceph on ext4 this can move upstream.
> >
> > Previously we were being forgiving about large setxattr failures on ext3,
> > but we found that was leading to corruption in certain cases (because we
> > couldn't set our internal metadata), so the next release will assert/crash
> > in that case (fail-stop instead of fail-maybe-eventually-corrupt).
> >
> > XFS does not have an xattr size limit and thus does have this problem.
> >
> > 2- The other problem is with OSD journal replay of non-idempotent
> > transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead
> > journal.  After restart, the OSD does not know exactly which transactions
> > in the journal may have already been committed to disk, and may reapply a
> > transaction again during replay.  For most operations (write, delete,
> > truncate) this is fine.
> >
> > Some operations, though, are non-idempotent.  The simplest example is
> > CLONE, which copies (efficiently, on btrfs) data from one object to
> > another.  If the source object is modified, the osd restarts, and then
> > the clone is replayed, the target will get incorrect (newer) data.  For
> > example,
> >
> > 1- clone A -> B
> > 2- modify A
> >   <osd crash, replay from 1>
> >
> > B will get new instead of old contents.
> >
> > (This doesn't happen on btrfs because the snapshots allow us to replay
> > from a known consistent point in time.)
> >
> > For things like clone, skipping the operation of the target exists almost
> > works, except for cases like
> >
> > 1- clone A -> B
> > 2- modify A
> > ...
> > 3- delete B
> >   <osd crash, replay from 1>
> >
> > (Although in that example who cares if B had bad data; it was removed
> > anyway.)  The larger problem, though, is that that doesn't always work;
> > CLONERANGE copies a range of a file from A to B, where B may already
> > exist.
> >
> > In practice, the higher level interfaces don't make full use of the
> > low-level interface, so it's possible some solution exists that careful
> > avoids the problem with a partial solution in the lower layer.  This makes
> > me nervous, though, as it is easy to break.
> >
> > Another possibility:
> >
> >  - on non-btrfs, we set a xattr on every modified object with the
> >   op_seq, the unique sequence number for the transaction.
> >  - for any (potentially) non-idempotent operation, we fsync() before
> >   continuing to the next transaction, to ensure that xattr hits disk.
> >  - on replay, we skip a transaction if the xattr indicates we already
> >   performed this transaction.
> >
> > Because every 'transaction' only modifies on a single object (file),
> > this ought to work.  It'll make things like clone slow, but let's face it:
> > they're already slow on non-btrfs file systems because they actually copy
> > the data (instead of duplicating the extent refs in btrfs).  And it should
> > make the full ObjectStore iterface safe, without upper layers having to
> > worry about the kinds and orders of transactions they perform.
> >
> > Other ideas?
> >
> > This issue is tracked at http://tracker.newdream.net/issues/213.
> >
> > sage
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>

ceph on btrfs [was Re: ceph on non-btrfs file systems]

Reply via email to