On Wed, 2018-02-07 at 09:06 -0800, Tejun Heo wrote: > On Tue, Feb 06, 2018 at 05:11:33PM -0800, Bart Van Assche wrote: > > The following race can occur between the code that resets the timer > > and completion handling: > > - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate. > > - A completion occurs and blk_mq_complete_request() calls > > __blk_mq_complete_request(). > > - The timeout code calls blk_add_timer() and that function sets the > > request deadline and adjusts the timer. > > - __blk_mq_complete_request() frees the request tag. > > - The timer fires and the timeout handler gets called for a freed > > request. > > Can you see whether by any chance the following patch fixes the issue? > If not, can you share the repro case? > > Thanks. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index df93102..651d18c 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool > reserved) > * ->aborted_gstate is set, this may lead to ignored > * completions and further spurious timeouts. > */ > - blk_mq_rq_update_aborted_gstate(req, 0); > blk_add_timer(req); > + blk_mq_rq_update_aborted_gstate(req, 0); > break; > case BLK_EH_NOT_HANDLED: > break;
Hello Tejun, Even with the above change I think that there is still a race between the code that handles timer resets and the completion handler. Anyway, the test with which I triggered these races is as follows: - Start from what will become kernel v4.16-rc1 and apply the patch that adds SRP over RoCE support to the ib_srpt driver. See also the "[PATCH v2 00/14] IB/srpt: Add RDMA/CM support" patch series (https://www.spinics.net/lists/linux-rdma/msg59589.html). - Apply my patch series that fixes a race between the SCSI error handler and SCSI transport recovery. - Apply my patch series that improves the stability of the SCSI target core (LIO). - Build and install that kernel. - Clone the following repository: https://github.com/bvanassche/srp-test. - Run the following test: while true; do srp-test/run_tests -c -t 02-mq; done - While the test is running, check whether or not something weird happens. Sometimes I see that scsi_times_out() crashes. Sometimes I see while running this test that a soft lockup is reported inside blk_mq_do_dispatch_ctx(). If you want I can share the tree on github that I use myself for my tests. Thanks, Bart.