Sage sent out an early draft of what we were thinking about doing for
fsck on CephFS at the beginning of the week, but it was a bit
incomplete and still very much a work in progress. I spent a good
chunk of today thinking about it more so that we can start planning
ticket-level chunks of work. The following is similar to where Sage's
email ended up, but incorporates a bit more thought about memory
scaling and is hopefully a bit more organized. :)

First, we are breaking up development and running of fsck into two
distinct phases. The first phase will consist of a "forward scrub",
which simply starts with the root directory inode and follows links
forward to check that it can find everything that's linked, and that
the forward- and backward-links are consistent. (Backward links are
under development right now; see http://tracker.ceph.com/issues/3540,
or the CephFS backlog at
http://tracker.ceph.com/rb/master_backlogs/cephfs, which is only
groomed for the first several items on the list but might be of
interest.) The intention for this phase is that it can be used both as
part of a requested full-system fsck, and separately can be used to do
background scrubbing during normal operation.
I've tried to think through this forward scrub phase enough to do real
development planning over the next couple of days, and have included
my description below. Please comment if you see issues or have
questions.

The second phase we're referring to as the "backward scan". This mode
is currently intended to be used as part of the fsck you would run
after somehow losing data in RADOS, and is exclusively an offline
operation — no client access to the data is permitted, etc and it
involves scanning through every object in the CephFS metadata and data
storage pools. We haven't thought this one through in quite as much
detail, but I wanted to figure out a mechanism (that scales to large
directories and hierarchies) enough to see how it might impact the
design of our forward scrub. I've got the details I came up with
below, but this is a much more complicated problem and not one we need
to start work on right way so it doesn't go into nearly as much depth.
Again though, please comment if you see any issues, have questions, or
think there's something in the backward scan that impacts the forward
scrub in a way I haven't accounted for!
Thanks,
Greg

========================================
MDS Forward Scrub
----------------------------------------------------------------------------
We maintain a stack of inodes to scrub. When a new scrub is requested,
the inode in question goes into this stack at a position depending on
how it's inserted.

We have a separate scrubbing thread in every MDS. This thread begins
in the scrub_node(inode) function, passing in the inode on the top of
the scrub stack.
scrub_node() starts by setting a new scrub_start_stamp and
scrub_start_version on the inode (where the scrub_start_version is the
version of the *parent* of the inode). If the node is a file:
the thread optionally spins off an async check of the backtrace (and
in the future, optionally checks other metadata we might be able to
add or pick up), then sleeps until finish_scrub(inode) is called. (If
it doesn't do the backtrace check, it calls finish_scrub() directly).
If the node is a dirfrag:
put the dirfrag's first child on the top of the stack, and call
scrub_node(child). Note that this might involve reading the dirfrag
off disk, etc.

finish_scrub(inode) is pretty simple. If the inode is a dirfrag:
It verifies that the parent's data matches the aggregate data of the
children, then does the same stuff as to a file:
1) sets last_scrubbed_stamp to scrub_start_stamp, and
last_scrubbed_version to scrub_start_version.
2) Pops the inode off of the scrub queue, and checks if the next thing
up is the inode's parent.
3) If so, calls scrub_node() on the dentry following this one in the
parent dirfrag.
3b) if there are no remaining nodes in the parent dirfrag, it checks
that all the children were scrubbed following the parent's
scrub_start_version (or modified — we don't want to scrub hierarchies
that were renamed into the tree following a scrub start), then calls
finish_scrub() on the dirfrag.

If at any point the scrub thread finishes scrubbing a node which does
not start up another one immediately (implying that another scrub got
injected into the middle of one that was already running), it looks at
the node in question. If it's a file, it calls scrub_node() on it. If
it's a dirfrag, it finds the first dentry in the dirfrag with a
last_scrubbed_version less than the dirfrag's last_scrubbed_version,
puts that dentry on the scrub_stack, and calls scrub_node() on that
dentry.

This is simple enough in concept (although functionally it will need
to be broken up quite a bit more in order to do all the locking in a
reasonably efficient fashion). To expand this to a multi-MDS system,
modify it slightly according to the following rules:
1) Only the authoritative MDS for an inode can scrub that inode.
2) If you are scrubbing a tree and reach an inode for which you are
not authoritative, you pass that scrub off to the authoritative node
until you get a result, and place the next inode in the tree on the
top of the stack and start scrubbing it.

But of course you'll note this doesn't include what to do if the
scrubbing turns up an issue. In the initial forward scrub
implementation, this is lame: add the bad object to a designated
key-value object in the RADOS metadata pool, and set an "inconsistent"
flag on it that is propagated up through its ancestors (via a separate
"inconsistent descendant" flag) and triggers admin notifications.

========================================
MDS Backwards Scrub
----------------------------------------------------------------------------

A reverse scan fsck will only be started at admin request, or if a
forward scrub detects inconsistencies. It disables client writes on
the cluster.

Very broadly:
One MDS is the scrub leader, responsible for maintaining the scrub
list. It might initially contain the list of problem inodes found in a
forward scrub, but it is in general populated by iterating through all
the objects in the metadata (and then data) pools.

For each directory or file head object, if it is not marked as already
scrubbed into place, the scrub leader attempts to find that item
within the already-known tree, using the (coming very shortly!)
lookup-by-ino functionality. If it can't place the inode, it chooses
to temporarily believe the backtrace on the inode and creates the
necessary directories and links, marking them as tentative and
including the version of the backtrace they came from. It then starts
a forward scrub on the dirfrag closest to the root that it was able to
retrieve off disk (that might be nothing, if it can't find any). (This
forward scrub will also be marked as based on a tentative backtrace,
with the version it came from.) Any inconsistencies the forward scrub
finds are marked and written to reference objects for later review.
(This would include things like "I'm sure the backtrace this inode has
which points to me is wrong, because I have a higher version and lack
a dentry for it"). Similarly, if the forward scrub finds objects on
disk with outdated data, it updates their data and marks the reference
objects to note that the object was fixed (and the version it was
fixed up to). If it finds newer data on disk, it incorporates that
into the current tree (with the tentative markings and the versions
that are associated). If the newer data points to a dirfrag that isn't
yet in the tree, it inserts a fake entry and puts it at the bottom of
the scrub queue. It then continues the forward scrub from the node it
was on.
If we find an on-disk version in either a forward or reverse scrub
which places authority for a subtree we're accessing, we stop any
on-going activity and ship it to the authoritative node. If we
discover that we should have authority over a node that somebody else
is currently holding, we send them a message and they stop working on
it and ship it over to us.
An object which does not contain a backpointer and that has no forward
referrents gets placed into a lost+found directory. :(
Once we've completely traversed the CephFS pools, we take the existing
tentative metadata as correct, toss out the pre-fsck versions, and
clean up.

This obviously elides a lot of important details, but I think it
describes an object-listing-based fsck that we can use to recover all
the data the cluster has into the filesystem hierarchy in a way that
scales. I believe the most difficult parts which aren't described here
will be a mechanism that allows maintaining both the original
un-changed data, and the in-progress fsck versions of the inodes, in a
way that allows us to maintain our standard hierarchy migration
mechanisms, journaling (or perhaps not, in this mode), and directory
object management tools. Assuming we can do that (I think we can!),
then this won't be fast, but it will be robust and hopefully not many
times slower than an optimal algorithm would be.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to