We had a hangout on Friday and made some progress. The problem can be
roughly decomposed into two types of checks: a forward, depth-first check
that scans the namespace and verifies everything is valid and reachable,
and a (slow) reverse iteration over all directory (or even file) objects
that verifies that everything is reachable via its backtrace.
Forward scan:
- depth-first search of the namespace.
- run by the mds
- mds can be online
- mds can efficiently verify the rstats as it goes
- we can propagate any 'dirty' rstats up the tree as we find them.
- optionally verify the backtrace values on directories and/or files as
we go
- this can be efficient to: a single backtrace update op can be guarded
by a check that the backtrace is valid/up to date.
- if we encounter a bad forward link (a primary dentry that references a
directory whose object does not exist), we add it to the set M
- we can store M as a key/value object somewhere.
- we complete the scan and M is empty, and the rstats/dirstats were all
clean/correct, we are happy.
- this can presumably be made a parallel operation without too much
trouble
- we could put a last_scrubbed_subtree stamp in the dirstat struct, but
it would mean dirtying directories as we go, which may not be what we
want in all cases. maybe make it optional.
- i don't think there is any reason why the fs couldn't online and in use
while this is going on. and that is what we want, anyway: online
namespace scrubbing for healthy file systems.
This would probably be the first piece we implement. There would be a
fast mode and a slow mode, depending on whether we verify
backtraces.
Backward scan:
The forward scan doesn't help us find lost/orphan directories, and it only
detects lost directories near the root. If M is non-empty, we need to do
the slow part. A backward scan would:
- iterate over directory objects in the metadata pool
- note that these are stored in hash(ino) order, so we get them in
random order.
- for each directory object, we look at the backtrace and traverse
forward in the tree to verify it is reachable.
- if it is not, we recreate the missing directory and add in the
forward link from our backtrace.
- we can mark that link as 'tentative' or 'recovered' or something,
along with the (now) partially repopulated directory (which is in M)
Note that doing the above for *all* items would be sufficient, but
horribly slow... basically, O(n * log^2(n)) or something (assuming some
caching). But if M is smallish:
- only verify backtrace for directories that are descendents of an item
in M
- and only verify the portion of the backtrace that is rooted above the
item from M; ignore the child portion.
- queue a forward scrub starting from that point to see if the
reattached subtree is complete.
Note that this is probably not so inefficient, even when that includes a
lot of directories, because the cache hit rate will be good. Say we
have/had /a/b/c/(1..10000000), and b was lost/in M. For the first c child
we hit, we would recreate an empty b and link it to c. Every subsequent
child would hit the cached c (with a single in-memory ino lookup) and/or
verify the a/b/c forward dentry/link and we'd move on.
The grand-child of an item in M could be missing to. So if b and c are
both lost, and we encounter /a/b/c/1, we add the c link and requeue the
forward scan, but immediately find that c is missing too and add it to M.
At that point, we would need to restart the scan. That is probably not
ideal. Instead, each item in M could have a range or interval_set over
which it has been scanned. The initial items would be [], but if we add
in c part-way through the pool, we could just continue our forward scan
and keep track of which parts of the namespace are checked for that
ancestor.
After a full pass over the metadata pool (all dir objects) and/or data
pool(s) (file objects), items in M that were present for the full pass can
be marked 'complete' in the fs, and their ancestors adjusted with the new
summation/stat/last_clean values.
The scan could be optimized in a couple of ways (maybe):
- push the O(n) search to the OSD
- a class 'filter' thing would examine each object and determine if it
is a match. it could
- it could parse the backtrace and pull out just the b and ancestors
part, and uniq it over a bunch of objects.
- maintain an actual index on the OSD
- add some new rados functionality to maintain a real index over select
object xattrs, such as backtraces (or more likely individual
backtrace components), so that the search is O(log n) instead of
O(n). or something along those lines.
The trick in pushing filtering to the OSD is that it needs to know what M
is. We can do that easily for small M, otherwise we'd need to use a bloom
filter or something.
The implementation tricks will be:
- making the forward scan parallel/distributed. handing off subtrees to
scan when they are delegated across nodes is the main thing, i think.
if there are actual repairs, those would probably appear to be normal
updates/mutations. the existing cache coherency should make the
parent's summation/check on a dir after the children complete not much
different than a normal filelock scatter/gather type operation.
- making the forward scan recoverable if an mds restarts. periodically
journal our current 'iterator' position(s)? or, if we actually write
out a last_clean stamp, we would just restart at the beginning and skip
any recently-scrubbed subtrees.
- allowing forward and backward work to be going on at the same time. a
backward scan will periodically requeue subtrees for forward scanning.
and a forward scan may add to M.
On that last point, I think deconstructing the depth-first search into
discrete pre- and post- steps (and not recursion) will simplify things,
and make it more easily parallel. For example:
scan(dir):
- add dir to Q
worker():
- pop X from front of Q
- if X is a dir:
- enumerate children
- push a "done with X" item to Q
- then push all children on the front of Q
- if X is a missing dir:
- add to M
- if X is a completion:
- do post(X)
post(dir):
- enumerate children
- recalculation summation (rstat, dirstat, whatever)
- verify that dir inode summation value is correct
...with something else in there to make us defer the post step for
any items or ancestors of items in M. And once repaired items are removed
from M, somehow requeue the post step so they and their ancestors can
validate the summations.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html