There are several issues we need to address.
0- Online scrub. We should be able to do a (parallel) pass over the
directory tree that verifies that metadata is consistent. Forward dentry
links agree with file and directory backtraces. rstats and dirstats are
correct. While the fs is active.
1- Missing/corrupt/incompete mds journal. If journal replay fails,
currently we just bomb out. Instead, we need to do one or more of:
- throw out journal(s) and current mds cluster members, and bring mds(s)
up.
- scour as much useful metadata out of the remaining bits of the journal
first.
- have a mode/flag (or not?) where we are in 'recovery' or 'unclean'
mode and will do some sort of online namespace repair.
2- Missing directory. This is the tricky one because a catastrophic rados
failure like a loss of a PG would mean we lose a random subset of the
directory objects. This would effectively prune off random subtrees of
the hierarchy, and we'd like to be able to find and reattach them. This
is what the backtrace stuff is there for.
3- Corrupt directory object, or corrupt inode metadata. If a directory
appears corrupt, we should salvage what we can and try to rebuild the
rest. This may just be similar to be above.
Our prevous discussions have focussed on how to handle #2, but I'm hoping
we can distill this down to a few common recovery behaviors that cover
entire classes of inconsistencies.
Handling a missing directory object is probably the hardest piece, so
let's start there. The basic idea we discussed before is to build a
working list of missing directories, say M. We then do a scan of the
objects in the metadata pool for objects that are children of M accoring
to their backtrace, and tentatively link them into place. If we encounter
some other forward link to one of those children with a newer versoin
stamp, the newer link wins and the tentative link is removed.
Having complete confidence in the recovered links means we need to scan
the full namespace.
Searching for children is current an O(n) operation for RADOS,
unless/until we introduce some indexing of objects. This may be feasible
with leveldb, but it's not there yet.
We need to also identify lost grandchildren. For /a/b/c, the a and b
directory objects may have been lost, but we only know that a is lost from
the broken /a link. If we only search for / children, the rebuilt /a/
won't include b (which is also lost). This means our tentative links may
need to include multiple ancestors, or be a multiple-pass type of
operation. The possibility of directory renames makes this especially
interesting. It may be that we shoot not so much for perfectly relinking
subtrees, but rather for exhaustively linking them, and just aggressively
push out backtrace updates after renames so that files will reappear
somewhere reasonably recent. (This is disaster recovery, after all.)
At a high level, what we are trying to recover is a map of
ino -> (version, parent ino, name)
for all parent inos in the missing set.
I'm pretty sure we had significantly more insight into what was involved
here, but it was almost 2 years ago now since we discussed it and I'm
having trouble dredging it up. Hopefully this is enough to get people
started...
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html