On Mon, Feb 12 2018, Patrick Farrell wrote:

> Neil,
> I didn't get anything after 8/19 in this series.  Is this just me?  (I'd keep 
> waiting, except I also found a few things in this patch.)

Not just you.  My fault.  They are appearing now.

> Minor:
> The line XXX ALLOCATE is out of date and could go.  (It refers to a
> mix of things you eliminated and things that were already gone.)

What does the line even mean?  Some comment about stack usage?
I think we have a look that looks for large stack frames.  I wonder how
to run it...

> Less minor:
> You remove use of the imp_lock when reading the connection count.  While 
> that'll work on x86, it's probably wrong on some architecture to read that 
> without taking the lock...?

It was my understanding that on all architectures which Linux support, a
32bit aligned read is atomic wrt any 32bit write.  I have trouble imagining how
it could be otherwise.

I probably should have highlighted the removal of the spinlock in the
patch description though - it was intentional.

> Bug:
> The existing code uses the imp_conn_cnt from *before* the wait, rather
> than after.  I think that's quite important.  So you'll want to read
> it out before the wait.  I think the main reason we'd hit the timeout
> is a disconnect, which should cause a reconnect, so it's very
> important to use the value from *before* the wait.  (See comment on
> ptlrpc_set_import_discon for more of an explanation.  Basically it's
> tracking a connection 'epoch', if it's changed, someone else already
> went through the reconnect code for this 'connection epoch' and we
> shouldn't start that process.) 

That wasn't intentional though - thanks for catching!
Looking at  ptlrpc_set_import_discon(), which is where the number
eventually gets used, it is only used to compare with the new value of

This would fix both (assuming the locking issue needs fixing).


diff --git a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c 
index f1233d844bbd..c3c9186b74ce 100644
--- a/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
+++ b/drivers/staging/lustre/lustre/ldlm/ldlm_request.c
@@ -103,7 +103,7 @@ static int ldlm_request_bufsize(int count, int type)
        return sizeof(struct ldlm_request) + avail;
-static void ldlm_expired_completion_wait(struct ldlm_lock *lock, struct 
obd_import *imp2)
+static void ldlm_expired_completion_wait(struct ldlm_lock *lock, __u32 
        struct obd_import *imp;
        struct obd_device *obd;
@@ -129,7 +129,7 @@ static void ldlm_expired_completion_wait(struct ldlm_lock 
*lock, struct obd_impo
        obd = lock->l_conn_export->exp_obd;
        imp = obd->u.cli.cl_import;
-       ptlrpc_fail_import(imp, imp2 ? imp2->imp_conn_cnt : 0);
+       ptlrpc_fail_import(imp, conn_cnt);
                   "lock timed out (enqueued at %lld, %llds ago), entering 
recovery for %s@%s",
@@ -241,6 +241,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 
flags, void *data)
        struct obd_device *obd;
        struct obd_import *imp = NULL;
        __u32 timeout;
+       __u32 conn_cnt = 0;
        int rc = 0;
        if (flags == LDLM_FL_WAIT_NOREPROC) {
@@ -268,6 +269,11 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 
flags, void *data)
        lock->l_last_activity = ktime_get_real_seconds();
+       if (imp) {
+               spin_lock(&imp->imp_lock);
+               conn_cnt = imp->imp_conn_cnt;
+               spin_unlock(&imp->imp_lock);
+       }
                                 OBD_FAIL_LDLM_CP_BL_RACE | OBD_FAIL_ONCE)) {
@@ -280,7 +286,7 @@ int ldlm_completion_ast(struct ldlm_lock *lock, __u64 
flags, void *data)
                                                     timeout * HZ);
                        if (rc == 0)
-                               ldlm_expired_completion_wait(lock, imp);
+                               ldlm_expired_completion_wait(lock, conn_cnt);
                /* Now wait abortable */
                if (rc == 0)

Attachment: signature.asc
Description: PGP signature

Reply via email to