On 8/29/22 3:00 PM, Uday Shankar wrote:
>> I hit the hang below and it should be fixed in this set:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/all/[email protected]/__;!!ACWV5N9M2RV99hQ!L10OSfsVJ6Bssm8eT4oRSTBOLfCKp7-4cM5kGDF4aXnYuBfB0xlTEzbFTvMMw0nqmbfLqSpaXVl-E_tVvIc6ZQqBrA$
>>
>
> I hit the hang I described while running a kernel with your fix in. My
Ah yeah sorry. I didn't read your first mail well. I tried to do a
quick reply from my phone on vacation :)
> hang does not involve scsi-eh; it happens when a transport error is
> detected by the kernel while iscsid is in the process of handling a
> previous transport error. If the second transport error happens at an
> unlucky timing (after iscsid does start_conn when handling the first> error,
> but before it sets the device state to running via sysfs), the
> device queues can be quiesced (in the block layer sense) when iscsid
> writes to sysfs and ends up in scsi_rescan_device.
Ok got it.
>
> I wanted to avoid this by just removing the write to sysfs from iscsid,
> but it seems that interferes with scsi-eh flow:
>
>> The user space online code in your patch handles the scsi-eh case.
>
> So I will need to try another approach. Simply adding a call to
> scsi_start_queue like so
>
> if (rescan_dev) {
> /* ...
> */
> + scsi_start_queue(sdev);
> blk_mq_run_hw_queues(sdev->request_queue, true);
> scsi_rescan_device(dev);
> }
>
> does not fix the issue either, since your linked patch moves the
> run_hw_queues and rescan calls out of the area protected by state_mutex.
> A race window still exists where the kernel could detect the second
> transport error and quiesce sdev's request queue after we unquiesce it
> but before we run the queues and rescan the device above.
>
> It seems that the most straightforward way to fix my issue without
> undoing your patch is to add another sync primitive that guards the sdev
> queue quiesced/unquiesced state so that we can guarantee the queues are
> unquiesced when we do the rescan above. Unless you have other
> suggestions, I'll try that approach.
>
Here is an alternative I was thinking about when I did that other
patchset.
I think we could remove the session_online_devs call for newer kernels.
The session_online_devs call was handling the case where a scsi cmd
timesout because the network is lost, we run the scsi eh, that ends
up getting escalated to us dropping the session, then iscsi_eh_session_reset
fails due to the replacement_timeout expiring.
With the patch:
commit a0c2f8b6709a9a4af175497ca65f93804f57b248
Author: Mike Christie <[email protected]>
Date: Fri Nov 5 17:10:47 2021 -0500
scsi: iscsi: Unblock session then wake up error handler
we should always end up in the transport-offline state for that
type of case. When we relogin, then the conn start's call to
__iscsi_unblock_session -> scsi_target_unblock will handle that
case and set the device to running. So we should not need the
userspace online/running call anymore.
So we could just add a CAP_SCSI_EH_TRANSPORT OFFLINE flag to
the iscsi_transport->caps struct. When userspace sees that then
it knows the kernel will now do the right thing.
The drawback is that we have to patch userspace and then also
get the the new CAP_SCSI_EH_TRANSPORT_OFFLINE patch in the stable
kernels.
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/open-iscsi/644defe5-2a33-9eda-ff7f-f600e635a48b%40oracle.com.