There is a problem that I was able to reproduce quite frequently, when
trying to untar the latest linux kernel source tree, running 'make
oldconfig' and then 'make -j2':
The make operation does not get through much progress and just hangs
there after the first few steps. The problem is timing sensitive, and
I wasn't able to reproduce it on my uml. What happens is that the mds
and the client disagree on some directory's caps. The causes that when
the mds sends a caps revocation request, the client ignores that
request, since it thinks that it has already revoked the specified
caps. Thus the mds waiting indefinitely for the client's response.
It seems that the root cause for the client-mds disagreement was that
while waiting for some mds readdir operation response, the client got
a signal (probably from another process) that made it return
ERESTARTSYS (btw, we should translate it to EINTR) and dropped the
actual mds response which would have updated the caps. So, a trivial
solution would be:

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index abc9776..1429ed0 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1605,17 +1605,15 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc,
        if (!req->r_reply) {
                mutex_unlock(&mdsc->mutex);
                if (req->r_timeout) {
-                       err = (long)wait_for_completion_interruptible_timeout(
+                       err = (long)wait_for_completion_timeout(
                                &req->r_completion, req->r_timeout);
                        if (err == 0)
                                req->r_reply = ERR_PTR(-EIO);
                        else if (err < 0)
                                req->r_reply = ERR_PTR(err);
                } else {
-                        err = wait_for_completion_interruptible(
+                        wait_for_completion(
                                 &req->r_completion);
-                        if (err)
-                                req->r_reply = ERR_PTR(err);
                }
                mutex_lock(&mdsc->mutex);
        }

As we've already discussed that recently, the problem with that is
that we won't be able to ^C while there are pending mds operations. We
also need to think of some other recovery mechanism for such
situations. E.g., instead of ignoring caps revocation request that
(the client thinks that it) does nothing, the client should respond in
any case.

Yehuda

------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Ceph-devel mailing list
Ceph-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ceph-devel

Reply via email to