On Mon, May 21, 2018 at 10:23:55AM -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 12:04:53AM +0800, Ming Lei wrote:
> > On Mon, May 21, 2018 at 09:44:33AM -0600, Keith Busch wrote:
> > > On Mon, May 21, 2018 at 11:34:27PM +0800, Ming Lei wrote:
> > > > nvme_dev_disable() quiesces queues first before killing queues.
> > > > 
> > > > If queues are quiesced during or before nvme_wait_freeze() is run
> > > > from the 2nd part of reset, the 2nd part can't move on, and IO hang
> > > > is caused. Finally no reset can be scheduled at all.
> > > 
> > > But this patch moves nvme_wait_freeze outside the reset path, so I'm
> > > afraid I'm unable to follow how you've concluded the wait freeze is
> > > somehow part of the reset.
> > 
> > For example:
> > 
> > 1) the 1st timeout event:
> > 
> > - nvme_dev_disable()
> > - reset
> > - scan_work
> > 
> > 2) the 2nd timeout event:
> > 
> > nvme_dev_disable() may come just after nvme_start_queues() in
> > the above reset of the 1st timeout. And nvme_timeout() won't
> > schedule a new reset since the controller state is NVME_CTRL_CONNECTING.
> 
> Let me get this straight -- you're saying nvme_start_queues is going
> to somehow immediately trigger timeout work? I can't see how that could

It may be difficult to trigger in reality, but won't be impossible
since the timeout value can be adjusted via module parameter, and any
schedule delay can be introduced in one busy system.

It isn't a good practice to rely on timing for avoiding race, IMO.

> possibly happen in real life, but we can just remove it and use the existing
> nvme_start_ctrl to handle that in the LIVE state.

OK, please fix other issues mentioned in the following comment together,
then I will review further and see if it can work well.

https://marc.info/?l=linux-block&m=152668465710591&w=2

Thanks,
Ming

Reply via email to