On Mon, 2018-04-09 at 09:47 -0700, Tejun Heo wrote:
> On Sun, Apr 08, 2018 at 10:20:38PM -0700, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs.
> Are we still talking about the same BLK_EH_RESET_TIMER case?  This can
> be solved by the two patches which rcu-synchronizes the hand-over to
> normal completion path, right?

Hello Tejun,

My opinion is not only that the two patches that you posted recently do not
fix all the races that are fixed by this patch but also that the races that
exist today in the blk-mq timeout handling code cannot be fixed completely
using RCU only.

> > Additionally, the blk-mq timeout handling code ignores completions that
> > occur after blk_mq_check_expired() has been called and before
> > blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> > timeout handler always returns BLK_EH_RESET_TIMER then the result will
> > be that the request never terminates.
> And this is the same race window which was always there, right?  I
> really don't think reducing or closing this window requires full
> synchronization.

That race window did not exist in the legacy block layer. I don't think
we should tolerate such a race window in the blk-mq code either. If a request
does not get completed that leads to unkillable processes or hanging kernel
code. Such issues are hard to identify when reported by users. I think we
should fix these races before these cause more trouble to Linux users.



Reply via email to