Erez Zilber wrote:
> Mike,
> 
> I'm trying to debug a problem that we have with iscsiadm: I'm running
> open-iscsi against multiple targets. At some point, I'm closing the
> connection from one of the targets (i.e. on the target side). Then, I
> try to logout from the initiator side, but something goes wrong. The
> last thing that iscsiadm does it call recv from iscsid_response and it
> doesn't return (at least not after 10 minutes). I also see that in the
> kernel, __iscsi_unbind_session calls scsi_remove_target and doesn't
> return. I guess that this causes iscsiadm to wait on the recv call.

Yeah, iscsiadm will wait for the iscsid operations like the unind to 
complete, and that can take a while.

If you stop the target and then we start the session shutdown process 
while we still think the session is up (we have not got a tcp connection 
error or rst or any other indication that is bad like a nop timing out), 
then we are going to end up firing the iscsi or scsi eh.

If you have IO running or if your LU requires a cache sync to be sent 
when shutting it down, then the worse case is that you have nops turned 
off, and for some reason the network layer does not return a error (just 
returns somehting we thing is retryable like EAGAIN) when we try to do 
sendpage/sendmsg. This will result in the scsi commands timing out. Then 
the aborts and other tmfs will timeout, and then we will wait for 
replacement_timeout seconds to try and reconnect.

If you have nops on or the net layer returns a error, it would be a 
little faster because you do not have to wait for scsi commands to time 
out. The nop will timeout after noop_timeout seconds, then we will wait 
for replacement_timeout seconds to reconnect. After that time we will 
fail everything.

if you do not have IO running and your device does not require cache 
syncs, then it should be a lot shorter, but still may be a minute. The 
__iscsi_unbind_session/scsi_remove_target should complete quickly since 
they do not have to wait on IO and cache syncs to complete. We would 
just wait for the logout iscsi pdu to timeout.


There is also a bug, where we retry the sending of data even though we 
know the connection is bad. This patch helps
http://git.kernel.org/?p=linux/kernel/git/mnc/linux-2.6-iscsi.git;a=commit;h=b138adb2df49967bf0a035143f734d33c4263963
but what we want is to be able to break from the sendpage/sendsg wait. I 
am working on a patch, but have hit some problems (for some reason if I 
send a signal it does not break from the wait). This problem only adds 
maybe 30 seconds extra for the logout of a session, so I am not sure 
that is what you are hitting.



So first check if your device needs a cache sync. You can check that by 
looking at /var/log/messages when the device is discovered. You will see 
  something like:

kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA

If write cache is enabled then the scsi layer will send cache syncs.

Then check your replacement_timeout. If that is really long, then we 
might be hitting that.




> 
> BTW - I'm not running with the latest code. My HEAD is commit
> ef0357c4728ebba1a4b91a7f6d69c729a5f9e6e3. I don't know if any relevant
> bug fixes were made lately.



Just so you know, I normally work on linux-2.6-iscsi, which tracks 
upstream, then port to open-iscsi/kernel, so the newest kernel patches 
will be in there.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to