On Wed, 2018-02-07 at 09:06 -0800, Tejun Heo wrote:
> On Tue, Feb 06, 2018 at 05:11:33PM -0800, Bart Van Assche wrote:
> > The following race can occur between the code that resets the timer
> > and completion handling:
> > - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> > - A completion occurs and blk_mq_complete_request() calls
> >   __blk_mq_complete_request().
> > - The timeout code calls blk_add_timer() and that function sets the
> >   request deadline and adjusts the timer.
> > - __blk_mq_complete_request() frees the request tag.
> > - The timer fires and the timeout handler gets called for a freed
> >   request.
> Can you see whether by any chance the following patch fixes the issue?
> If not, can you share the repro case?
> Thanks.
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df93102..651d18c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -836,8 +836,8 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>                * ->aborted_gstate is set, this may lead to ignored
>                * completions and further spurious timeouts.
>                */
> -             blk_mq_rq_update_aborted_gstate(req, 0);
>               blk_add_timer(req);
> +             blk_mq_rq_update_aborted_gstate(req, 0);
>               break;
>       case BLK_EH_NOT_HANDLED:
>               break;

Hello Tejun,

Even with the above change I think that there is still a race between the
code that handles timer resets and the completion handler. Anyway, the test
with which I triggered these races is as follows:
- Start from what will become kernel v4.16-rc1 and apply the patch that adds
  SRP over RoCE support to the ib_srpt driver. See also the "[PATCH v2 00/14]
  IB/srpt: Add RDMA/CM support" patch series
- Apply my patch series that fixes a race between the SCSI error handler and
  SCSI transport recovery.
- Apply my patch series that improves the stability of the SCSI target core
- Build and install that kernel.
- Clone the following repository: https://github.com/bvanassche/srp-test.
- Run the following test:
  while true; do srp-test/run_tests -c -t 02-mq; done
- While the test is running, check whether or not something weird happens.
  Sometimes I see that scsi_times_out() crashes. Sometimes I see while running
  this test that a soft lockup is reported inside blk_mq_do_dispatch_ctx().

If you want I can share the tree on github that I use myself for my tests.



