Hey, Sorry for the delay.
I tested your patch and it does work. Do you want me to send your change as a full patch? Can I add your signed-off-by? On 2019-07-18 6:50 p.m., Sagi Grimberg wrote: >> I didn't think the scan_lock was that contested or that >> nvme_change_ctrl_state() was really called that often... > > it shouldn't be, but I think it makes the flow more convoluted > as we serialize by flushing the scan_work right after... I would argue that the check for state in nvme_scan_work() without a lock is racy and confusing. There's nothing to prevent the state from changing immediately after the check. > The design principal is met as we do get the I/O failing, > but its just that with mpath we simply queue the I/O again > because the head->list happens to not be empty. > Perhaps taking care of that check is cleaner. Yes, I feel your patch is a good solution on it's own merits. > Thanks. Do you have a firm reproducer for it? Yes. If you connect to and then immediately disconnect from a target (at least with nvme-loop) you will reliably trigger this bug -- or one of the others I've sent patches for. >>>> + mutex_lock(&ctrl->scan_lock); >>>> + >>>> if (ctrl->state != NVME_CTRL_LIVE) >>>> return; >>> >>> unlock >> >> If we unlock here and relock below, we'd have to recheck the ctrl->state >> to avoid any races. If you don't want to call nvme_identify_ctrl with >> the lock held, then it would probably be better to move the state check >> below it. > > Meant before the return statement. Ah, right, my mistake. Logan

