There's a fun potential problem with CEPH_MDS_OP_LOOKUPSNAP handling
in ceph_fill_trace().  Consider the following scenario:

Process calls stat(2).  Lookup locks parent, allocates dentry and calls
->lookup().  Request is created and sent over the wire.  Then we sit and
wait for completion.  Just as the reply has arrived, process gets SIGKILL.
OK, we get to
                /*
                 * ensure we aren't running concurrently with
                 * ceph_fill_trace or ceph_readdir_prepopulate, which
                 * rely on locks (dir mutex) held by our caller.
                 */
                mutex_lock(&req->r_fill_mutex);
                req->r_err = err;
                req->r_aborted = true;
                mutex_unlock(&req->r_fill_mutex);
and we got there before handle_reply() grabbed ->r_fill_mutex.  Then we return
to ceph_lookup(), drop the reference to request and bugger off.  Parent is
unlocked by caller.

In the meanwhile, there's another thread sitting in handle_reply().  It
got ->r_fill_mutex and called ceph_fill_trace().  Had that been something
like rename request, ceph_fill_trace() would've checked req->r_aborted and
that would've been it.  However, we hit this:
        } else if (req->r_op == CEPH_MDS_OP_LOOKUPSNAP ||
                   req->r_op == CEPH_MDS_OP_MKSNAP) {
                struct dentry *dn = req->r_dentry;
and proceed to
                dout(" linking snapped dir %p to dn %p\n", in, dn);
                dn = splice_dentry(dn, in, NULL, true);
which does
        realdn = d_materialise_unique(dn, in);
and we are in trouble - d_materialise_unique() assumes that ->i_mutex on parent
is held, which isn't guaranteed anymore.  Not that d_delete() done a couple
of lines earlier was any better...

I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't
get to its splice_dentry() and d_delete() calls in similar situations -
I hadn't checked that one yet.  If it isn't guaranteed, we have a problem
there as well.

I might very well be missing something - that code is seriously convoluted,
and race wouldn't be easy to hit, so I don't have anything resembling
a candidate reproducer ;-/  IOW, this is just from RTFS and I'd really
appreciate comments from folks familiar with ceph.

VFS side of requirements is fairly simple:
        * d_splice_alias(d, _), d_add_ci(d, _), d_add(d, _),
d_materialise_unique(d, _), d_delete(d), d_move(_, d) should be called
only with ->i_mutex held on the parent of d.
        * d_move(d, _), d_add_unique(d, _), d_instantiate_unique(d, _),
d_instantiate(d, _) should be called only with d being parentless (i.e.
d->d_parent == d, aka. IS_ROOT(d)) or with ->i_mutex held on the parent of d.
        * with the exception of "prepopulate dentry tree at ->get_sb() time"
kind of situations, d_alloc(d, _) and d_alloc_name(d, _) should be called
only with d->d_inode->i_mutex held (and it won't be too hard to get rid of
those exceptions, actually).
        * lookup_one_len(_, d, _) should only be called with ->i_mutex held
on d->d_inode
        * d_move(d1, d2) in case when d1 and d2 have different parents
should only be called with ->s_vfs_rename_mutex held on d1->d_sb (== d2->d_sb).

We are guaranteed that ->i_mutex is held on (inode of) parent of d in
        ->lookup(_, d, _)
        ->atomic_open(_, d, _, _, _, _)
        ->mkdir(_, d, _)
        ->symlink(_, d, _)
        ->create(_, d, _, _)
        ->mknod(_, d, _, _)
        ->link(_, _, d)
        ->unlink(_, d)
        ->rmdir(_, d)
        ->rename(_, d, _, _)
        ->rename(_, _, _, d)
Note that this is *not* guaranteed for another argument of ->link() - the
inode we are linking has ->i_mutex held, but nothing of that kind is promised
for its parent directory.
We also are guaranteed that ->i_mutex is held on the inode of opened
directory passed to ->readdir() and on victims of ->unlink(), ->rmdir() and
overwriting ->rename().

FWIW, I went through that stuff this weekend and we are fairly close to having
those requirements satisfied - I'll push a branch with the accumulated fixes
in a few and after that we should be down to very few remaining violations and
dubious places (ceph issues above being one of those).  And yes, this stuff
really need to be in Documentation/filesystems somewhere, along with the
full description of locking rules for ->d_parent and ->d_name accesses.  I'm
trying to put that together right now...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to