On Jan 15, 2013, at 10:23 PM, Mike Christie <[email protected]> wrote:
> On 01/15/2013 08:26 PM, William L Duncan wrote: >> Hi Mike: >> >> I am working on a somewhat-old SUSE bug that you may remember, >> since you commented on it, though it's been 18 months. For reference: >> >> https://bugzilla.novell.com/show_bug.cgi?id=645616 >> >> This bug has to do with a user having a problem when running Veritas >> on top of an open-iscsi volume. In SLES 10 SP2 they say that when one >> of their two paths to the target failed, iSCSI took 120 seconds >> (replacement_timeout) >> before failing, giving the back-end controllers time to failover, but in >> SP4, they >> found that when one of their back-end controllers went out that iSCSI failed >> over immediately. > > > When that happened did you see iscsid kill the sessions instead of > retrying the login? If this happened then you would no longer see it in > sys/class/iscsi_sessions and iscsiadm -m session would no longer report > info for it. > > Or does the session still exist in kernel and in iscsid? The session is still there: root@sles10vm(1):/tmp# iscsiadm -m session tcp: [3] 192.168.20.2:3260,1 iqn.2001-04.net.gonzoleeman:test.disk.laptop.001 > >> >> It looks like SLES 10 SP4 uses open-iscsi-2.0.865 with a few patches. >> >> I am using iscsitarget, and I am stopping my target to simulate this >> condition, >> where the replacement_timeut should be triggered. This this the correct >> method? >> > > Yeah, that should trigger it. Are you seeing IO failed with DID_BUS_BUSY > too like in the novel bugzilla? I never saw "DID_BUS_BUSY" in the logs attached to the original bug report, and I don't see any such message in /var/log/messages now. I know my kernel has the "DID_BUS_BUSY" code present, but I just don't know how to tell if anything is returning that or not. > > >> I tried applying a patch from git hash >> 1f1641b2c92df43895367296785fe8e4e9f96273 >> "iscsid: fix relogin retry handling", but to no avail, as that does not help. >> >> When I trigger this condition (using "-d 8"), it seems to go into a >> retry-forever >> loop, in state " iscsid: login failed STATE_XPT_WAIT/R_STAGE_SESSION_REOPEN >> 257". >> The only thing that changes is the retry count, which seems to increase >> without >> bounds. This explains why the patch I tried does not help, since that patch >> does not modify handling of this state. >> >> I would try a newer open-iscsi, but I read on the mailing list about possible >> retry problems with 2.0.870, so I thought I'd ask your opinion before I try >> that. >> >> I'd be glad to supply the log file, but it's awfully large ... Any ideas you >> have >> would be most appreciated. >> > > > Do you have a log with the iscsid -d 8 output and the kernel output? If > so put it in a place where I can download it. > The compressed log file is only about 8k, so I will email it to you directly, as an attachment. If that doesn't work I can find some place to ftp it, if needed. Note that I ran "iscsiadm -m session -P 2" before and during this problem, and the diffs are minimal: 15,16c15,16 < SID: 1 < iSCSI Connection State: LOGGED IN --- > SID: 2 > iSCSI Connection State: TRANSPORT WAIT 18c18 < Internal iscsid Session State: NO CHANGE --- > Internal iscsid Session State: REPOEN > Could you also send me the open-iscsi and iscsi kernel source you are using. I am using the latest kernel sources for SLES 10 SP4. This is based on a 2.6.16.60 kernel, with patches, including 2 of the 3 you suggested (see below). The open-iscsi is version 2.0.865, with quite a few patches -- almost 60 of them. You can actually see exactly how the code looks if you pull from Hannes' open-iscsi SUSE public repository, available at: git://github.com/hreinecke/open-iscsi.git under the sles10-sp4 branch (of course). NOTE: in the previously-mentioned bug report, you suggested to Hannes that we needed 3 patches: > VxDMP sets fast fail, right? I think the driver is returning with > DID_BUS_BUSY, > then this hits the fast fail check in scsi_error.c. > > Upstream I added the DID_TRANSPORT_DISRUPTED use in libiscsi.c and then added > handling in scsi_error.c. Upstream commits: > > > http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=4a27446f3e39b06c28d1c8e31d33a5340826ed5c > http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=a4dfaa6f2e55b736adf2719133996f7e7dc309bc > http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=56d7fcfa815564b40a1b0ec7a30ea8cb3bc0713e And Hannes replied that all were present. It turns out that the first one, hash=4a27446f3e, actually is not present, though the other two are. This is the patch that added the function scsi_noretry_cmd() to scsi_error.c (and its use). I will try adding this one, since at a glance it may be relevant. -- Lee Duncan "A witty saying proves nothing." -- Voltaire -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
