Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

[email protected] Wed, 11 Apr 2018 07:17:22 -0700

Hello, Bart.

On Wed, Apr 11, 2018 at 12:50:51PM +0000, Bart Van Assche wrote:
> Thank you for having shared this patch. It looks interesting to me. What I
> know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this patch
>   applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear when
>   the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
>   memory upon timeout"
>   (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
>   and also to a "RIP: scsi_times_out+0x17" crash during boot
>   (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
> 
> So we have the choice between two approaches:
> (1) apply the patch from your previous e-mail and root-cause and fix the
>     crashes referred to above.
> (2) apply a patch that makes the crashes reported against v4.16 disappear and
>     remove the atomic instructions introduced by such a patch at a later time.
> 
> Since crashes have been reported for kernel v4.16 I think we should follow
> approach (2). That will remove the time pressure from root-causing and fixing
> the crashes reported for the NVMeOF initiator and SCSI initiator drivers.


So, it really bothers me how blind we're going about this.  It isn't
an insurmountable emergency that we have to adopt whatever solution
which passed a couple tests this minute.  We can debug and root cause
this properly and pick the right solution.  We even have two most
likely causes already analysed and patches proposed, one of them
months ago.  If we wanna change the handover model, let's do that
because the new one is better, not because of vague fear.

Thanks.

-- 
tejun

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

Reply via email to