Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829



(From update of attachment 9435)
> #define cfs_schedule_timeout(s, t)              \
>-        do {                                    \
>-                cfs_waitlink_t    l;            \
>-                cfs_waitq_timedwait(&l, s, t);  \
>-        } while (0)
>+({                                              \
>+        cfs_duration_t _ret;                    \
>+        cfs_waitlink_t    l;                    \
>+        _ret = cfs_waitq_timedwait(&l, s, t);   \
>+        _ret;                                   \
>+})

Strange, I haven't noticed such problems, but maybe I missed them?  What other
code uses cfs_schedule_timeout(), and should it be changed to do the same thing
as OBD_FAIL_TIMEOUT?  I think this change needs inspection from others with
more knowledge of this area, maybe Nikita and/or Oleg?  What kernel were you
testing with?

> #define OBD_FAIL_TIMEOUT(id, secs)                                           \
> do {                                                                         \
>         if (OBD_FAIL_CHECK_ONCE(id)) {                                       \
>+                cfs_duration_t timeout = cfs_time_seconds(secs);             \
>                 CERROR("obd_fail_timeout id %x sleeping for %d secs\n",      \
>                        (id), (secs));                                        \
>+                do {                                                         \
>+                        set_current_state(TASK_UNINTERRUPTIBLE);             \
>+                        timeout = cfs_schedule_timeout(CFS_TASK_UNINT,       \
>+                                            timeout);                        \
>+                        CERROR("cfs_schedule_timeout return %ld\n", timeout);\
>+                } while (timeout > 0);                                       \

This could all be done inside the cfs_schedule_timeout() macro also?

>@@ -638,6 +638,7 @@ static int after_reply(struct ptlrpc_req
>         lustre_msg_set_transno(req->rq_reqmsg, req->rq_transno);
> 
>         if (req->rq_import->imp_replayable) {
>+                //OBD_FAIL_TIMEOUT(OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY, 
>obd_timeout);

We may as well just make this a separate failure location and make 2 versions
of test_8 (8a, 8b).  I think you should also increase the timeout to be
slightly larger, like obd_timeout * 2, so that we definitely get into recovery
and the OST finishes recovery before this times out.

>+test_8() {
>+    ost_facet=${ost1_svc}
>+    do_facet ost1 $LCTL --device %$ost_facet readonly
>+    # don't set notransno - we want transactions to commit that are "lost"
>+    dd if=/dev/zero of=$DIR/$tfile bs=4k count=1 || error "dd $tfile failed"
>+    # might need an OBD_FAIL_TIMEOUT in after_reply() so the request is still
>+    # waiting on replay list when transno goes back in time and recovery 
>starts
>+#define OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY      0x507
>+    do_facet ost1 "sysctl -w lustre.fail_loc=0x80000507"
>+
>+    sync; sleep 2; sync
>+    fail ost1
>+
>+    dmesg | grep "went back in time" || error "didn't go back in time"
>+    # would LBUG/hang here without this fix
>+    do_facet ost1 "sysctl -w lustre.fail_loc=0"
>+}
>+run_test 8 "Fail OST testing transno goes back"

I can't think of any other way to hit this failure, and this resembles the
customer failure case as best as I can tell.

There were also reports that the "fix" patch in attachment 9086 caused
acceptance-small.sh to crash.  Did you have any similar problems when running
acceptance-small.sh with the fix in place?

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to