Re: SUSE open-iscsi bug on replacement_timeout [resend]

Lee Duncan Thu, 17 Jan 2013 18:18:15 -0800

On Jan 15, 2013, at 10:23 PM, Mike Christie <[email protected]> wrote:


> On 01/15/2013 08:26 PM, William L Duncan wrote:
>> Hi Mike:
>> 
>> I am working on a somewhat-old SUSE bug that you may remember,
>> since you commented on it, though it's been 18 months. For reference:
>> 
>> https://bugzilla.novell.com/show_bug.cgi?id=645616
>> 
>> This bug has to do with a user having a problem when running Veritas
>> on top of an open-iscsi volume. In SLES 10 SP2 they say that when one
>> of their two paths to the target failed, iSCSI took 120 seconds 
>> (replacement_timeout) 
>> before failing, giving the back-end controllers time to failover, but in 
>> SP4, they 
>> found that when one of their back-end controllers went out that iSCSI failed 
>> over immediately.
> 
> 
> When that happened did you see iscsid kill the sessions instead of
> retrying the login? If this happened then you would no longer see it in
> sys/class/iscsi_sessions and iscsiadm -m session would no longer report
> info for it.
> 
> Or does the session still exist in kernel and in iscsid?

The session is still there:

root@sles10vm(1):/tmp# iscsiadm -m session
tcp: [3] 192.168.20.2:3260,1 iqn.2001-04.net.gonzoleeman:test.disk.laptop.001


> 
>> 
>> It looks like SLES 10 SP4 uses open-iscsi-2.0.865 with a few patches.
>> 
>> I am using iscsitarget, and I am stopping my target to simulate this 
>> condition, 
>> where the replacement_timeut should be triggered. This this the correct 
>> method?
>> 
> 
> Yeah, that should trigger it. Are you seeing IO failed with DID_BUS_BUSY
> too like in the novel bugzilla?

I never saw "DID_BUS_BUSY" in the logs attached to the original bug report,
and I don't see any such message in /var/log/messages now. I know my kernel
has the "DID_BUS_BUSY" code present, but I just don't know how to tell if
anything is returning that or not.

> 
> 
>> I tried applying a patch from git hash 
>> 1f1641b2c92df43895367296785fe8e4e9f96273 
>> "iscsid: fix relogin retry handling", but to no avail, as that does not help.
>> 
>> When I trigger this condition (using "-d 8"), it seems to go into a 
>> retry-forever 
>> loop, in state " iscsid: login failed STATE_XPT_WAIT/R_STAGE_SESSION_REOPEN 
>> 257".
>> The only thing that changes is the retry count, which seems to increase 
>> without
>> bounds. This explains why the patch I tried does not help, since that patch
>> does not modify handling of this state.
>> 
>> I would try a newer open-iscsi, but I read on the mailing list about possible
>> retry problems with 2.0.870, so I thought I'd ask your opinion before I try 
>> that.
>> 
>> I'd be glad to supply the log file, but it's awfully large ... Any ideas you 
>> have
>> would be most appreciated.
>> 
> 
> 
> Do you have a log with the iscsid -d 8 output and the kernel output? If
> so put it in a place where I can download it.
> 


The compressed log file is only about 8k, so I will email it to you directly, as
an attachment. If that doesn't work I can find some place to ftp it, if needed.

Note that I ran "iscsiadm -m session -P 2" before and during this problem, and
the diffs are minimal:

15,16c15,16
<               SID: 1
<               iSCSI Connection State: LOGGED IN
---
>              SID: 2
>              iSCSI Connection State: TRANSPORT WAIT
18c18
<               Internal iscsid Session State: NO CHANGE
---
>              Internal iscsid Session State: REPOEN


> Could you also send me the open-iscsi and iscsi kernel source you are using.

I am using the latest kernel sources for SLES 10 SP4. This is based on
a 2.6.16.60 kernel, with patches, including 2 of the 3 you suggested
(see below).

The open-iscsi is version 2.0.865, with quite a few patches -- almost 60 of 
them.

You can actually see exactly how the code looks if you pull from Hannes' 
open-iscsi
SUSE public repository, available at:

git://github.com/hreinecke/open-iscsi.git

under the sles10-sp4 branch (of course).

NOTE: in the previously-mentioned bug report, you suggested to Hannes that
we needed 3 patches:

> VxDMP sets fast fail, right? I think the driver is returning with 
> DID_BUS_BUSY,
> then this hits the fast fail check in scsi_error.c.
> 
> Upstream I added the DID_TRANSPORT_DISRUPTED use in libiscsi.c and then added
> handling in scsi_error.c. Upstream commits:
> 
> 
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4a27446f3e39b06c28d1c8e31d33a5340826ed5c
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=a4dfaa6f2e55b736adf2719133996f7e7dc309bc
> http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=56d7fcfa815564b40a1b0ec7a30ea8cb3bc0713e

And Hannes replied that all were present. It turns out that the first one, 
hash=4a27446f3e,
actually is not present, though the other two are. This is the patch that added 
the
function scsi_noretry_cmd() to scsi_error.c (and its use). I will try adding 
this one,
since at a glance it may be relevant.
-- 
Lee Duncan

"A witty saying proves nothing." -- Voltaire



-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: SUSE open-iscsi bug on replacement_timeout [resend]

Reply via email to