Hello, Bart.

Sorry about the delay.

On Thu, Feb 08, 2018 at 04:31:43PM +0000, Bart Van Assche wrote:
> The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The
> instruction at that address tries to dereference scsi_cmnd.device (%rax). The
> register dump shows that that pointer has the value NULL. The only function I
> know of that clears the scsi_cmnd.device pointer is scsi_req_init(). The only
> caller of that function in the SCSI core is scsi_initialize_rq(). That 
> function
> has two callers, namely scsi_init_command() and blk_get_request(). However,
> the scsi_cmnd.device pointer is not cleared when a request finishes. This is
> why I think that the above crash report indicates that scsi_times_out() was
> called for a request that was being reinitialized and not by device 
> hotplugging.

Can you please give the following patch a shot?  While timeout path is
synchornizing against the completion path (and the following re-init)
while taking back control of a timed-out request, it wasn't doing that
while giving it back, so the timer registration could race against
completion and re-issue.  I'm still not quite sure how that can lead
to the oops tho.  Anyways, we need something like this one way or the
other.

This isn't the final patch.  We should add batching-up of rcu
synchronize calls similar to the abort path.

Thanks.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102..b66aec3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -816,7 +816,8 @@ struct blk_mq_timeout_data {
        unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+                               bool reserved)
 {
        const struct blk_mq_ops *ops = req->q->mq_ops;
        enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -836,8 +837,12 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
                 * ->aborted_gstate is set, this may lead to ignored
                 * completions and further spurious timeouts.
                 */
-               blk_mq_rq_update_aborted_gstate(req, 0);
                blk_add_timer(req);
+               if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+                       synchronize_rcu();
+               else
+                       synchronize_srcu(hctx->srcu);
+               blk_mq_rq_update_aborted_gstate(req, 0);
                break;
        case BLK_EH_NOT_HANDLED:
                break;
@@ -893,7 +898,7 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx 
*hctx,
         */
        if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
            READ_ONCE(rq->gstate) == rq->aborted_gstate)
-               blk_mq_rq_timed_out(rq, reserved);
+               blk_mq_rq_timed_out(hctx, rq, reserved);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)

Reply via email to