On Tue, 2018-02-13 at 13:20 -0800, t...@kernel.org wrote:
> On Thu, Feb 08, 2018 at 04:31:43PM +0000, Bart Van Assche wrote:
> > The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. 
> > The
> > instruction at that address tries to dereference scsi_cmnd.device (%rax). 
> > The
> > register dump shows that that pointer has the value NULL. The only function 
> > I
> > know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The 
> > only
> > caller of that function in the SCSI core is scsi_initialize_rq(). That 
> > function
> > has two callers, namely scsi_init_command() and blk_get_request(). However,
> > the scsi_cmnd.device pointer is not cleared when a request finishes. This is
> > why I think that the above crash report indicates that scsi_times_out() was
> > called for a request that was being reinitialized and not by device 
> > hotplugging.
> 
> Can you please give the following patch a shot?  While timeout path is
> synchornizing against the completion path (and the following re-init)
> while taking back control of a timed-out request, it wasn't doing that
> while giving it back, so the timer registration could race against
> completion and re-issue.  I'm still not quite sure how that can lead
> to the oops tho.  Anyways, we need something like this one way or the
> other.
> 
> This isn't the final patch.  We should add batching-up of rcu
> synchronize calls similar to the abort path.
> 
> Thanks.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df93102..b66aec3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -816,7 +816,8 @@ struct blk_mq_timeout_data {
>       unsigned int nr_expired;
>  };
>  
> -static void blk_mq_rq_timed_out(struct request *req, bool reserved)
> +static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
> *req,
> +                             bool reserved)
>  {
>       const struct blk_mq_ops *ops = req->q->mq_ops;
>       enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
> @@ -836,8 +837,12 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>                * ->aborted_gstate is set, this may lead to ignored
>                * completions and further spurious timeouts.
>                */
> -             blk_mq_rq_update_aborted_gstate(req, 0);
>               blk_add_timer(req);
> +             if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> +                     synchronize_rcu();
> +             else
> +                     synchronize_srcu(hctx->srcu);
> +             blk_mq_rq_update_aborted_gstate(req, 0);
>               break;
>       case BLK_EH_NOT_HANDLED:
>               break;
> @@ -893,7 +898,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx 
> *hctx,
>        */
>       if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
>           READ_ONCE(rq->gstate) == rq->aborted_gstate)
> -             blk_mq_rq_timed_out(rq, reserved);
> +             blk_mq_rq_timed_out(hctx, rq, reserved);
>  }
>  
>  static void blk_mq_timeout_work(struct work_struct *work)

Hello Tejun,

With this patch applied the tests I ran so far pass.

Thanks,

Bart.



Reply via email to