Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=9829
(From update of attachment 9435)
> #define cfs_schedule_timeout(s, t) \
>- do { \
>- cfs_waitlink_t l; \
>- cfs_waitq_timedwait(&l, s, t); \
>- } while (0)
>+({ \
>+ cfs_duration_t _ret; \
>+ cfs_waitlink_t l; \
>+ _ret = cfs_waitq_timedwait(&l, s, t); \
>+ _ret; \
>+})
Strange, I haven't noticed such problems, but maybe I missed them? What other
code uses cfs_schedule_timeout(), and should it be changed to do the same thing
as OBD_FAIL_TIMEOUT? I think this change needs inspection from others with
more knowledge of this area, maybe Nikita and/or Oleg? What kernel were you
testing with?
> #define OBD_FAIL_TIMEOUT(id, secs) \
> do { \
> if (OBD_FAIL_CHECK_ONCE(id)) { \
>+ cfs_duration_t timeout = cfs_time_seconds(secs); \
> CERROR("obd_fail_timeout id %x sleeping for %d secs\n", \
> (id), (secs)); \
>+ do { \
>+ set_current_state(TASK_UNINTERRUPTIBLE); \
>+ timeout = cfs_schedule_timeout(CFS_TASK_UNINT, \
>+ timeout); \
>+ CERROR("cfs_schedule_timeout return %ld\n", timeout);\
>+ } while (timeout > 0); \
This could all be done inside the cfs_schedule_timeout() macro also?
>@@ -638,6 +638,7 @@ static int after_reply(struct ptlrpc_req
> lustre_msg_set_transno(req->rq_reqmsg, req->rq_transno);
>
> if (req->rq_import->imp_replayable) {
>+ //OBD_FAIL_TIMEOUT(OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY,
>obd_timeout);
We may as well just make this a separate failure location and make 2 versions
of test_8 (8a, 8b). I think you should also increase the timeout to be
slightly larger, like obd_timeout * 2, so that we definitely get into recovery
and the OST finishes recovery before this times out.
>+test_8() {
>+ ost_facet=${ost1_svc}
>+ do_facet ost1 $LCTL --device %$ost_facet readonly
>+ # don't set notransno - we want transactions to commit that are "lost"
>+ dd if=/dev/zero of=$DIR/$tfile bs=4k count=1 || error "dd $tfile failed"
>+ # might need an OBD_FAIL_TIMEOUT in after_reply() so the request is still
>+ # waiting on replay list when transno goes back in time and recovery
>starts
>+#define OBD_FAIL_PTLRPC_DELAY_AFTER_REPLY 0x507
>+ do_facet ost1 "sysctl -w lustre.fail_loc=0x80000507"
>+
>+ sync; sleep 2; sync
>+ fail ost1
>+
>+ dmesg | grep "went back in time" || error "didn't go back in time"
>+ # would LBUG/hang here without this fix
>+ do_facet ost1 "sysctl -w lustre.fail_loc=0"
>+}
>+run_test 8 "Fail OST testing transno goes back"
I can't think of any other way to hit this failure, and this resembles the
customer failure case as best as I can tell.
There were also reports that the "fix" patch in attachment 9086 caused
acceptance-small.sh to crash. Did you have any similar problems when running
acceptance-small.sh with the fix in place?
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel