Hi Mike: I think I may have figured out my problem, at least partly ... answers below.
On Jan 23, 2013, at 1:58 PM, Mike Christie <[email protected]> wrote: > On 01/17/2013 07:18 PM, Lee Duncan wrote: >>>> Yeah, that should trigger it. Are you seeing IO failed with DID_BUS_BUSY >>>> too like in the novel bugzilla? >> I never saw "DID_BUS_BUSY" in the logs attached to the original bug report, >> and I don't see any such message in /var/log/messages now. I know my kernel >> has the "DID_BUS_BUSY" code present, but I just don't know how to tell if >> anything is returning that or not. >> > > In older SLES kernels we just got the hex value of the errors variable. So: > > Oct 3 19:34:10 IBMx3250-200-174 kernel: sd 1:0:0:0: SCSI error: return > code = 0x00020000 > > is DID_BUS_BUSY right? Yes, of course you are correct. :) I was looking for the string itself. (Duh!) > > > In the log you sent I do not see any errors except: > > Jan 16 13:02:49 sles10vm kernel: sd 2:0:0:0: timing out command, waited 60s > > This sort of makes sense because it looks like the failure has lasted a > minute (start of error below): > > Jan 16 13:01:49 sles10vm kernel: connection2:0: iscsi: detected conn > error (1011) > > What does not make sense is why that command is floating around getting > retried and hitting that time check. In the upstream code the IO is in a > blocked queue so it should not hit that check while blocked. While > replacement/recovery timeout has not expired then it should just be > sitting in the queue and not hit that timeout out command code path. > > When you run the test, before IO is failed with that error, is the iscsi > device in the state "blocked"? You can run iscsiadm -m session -P 3 to > see the states. Thank you for this wonderful clue! I should have been looking at the state of the session during this error myself. [Note that I also changed my I/O for this test from sg_io to regular (non-buffered) reads, as you suggested in another email to me.] As to the iscsi state during my test, instead of showing the "-P 3" state for before and after, what was more interesting was the diff between the two states: zsh# diff *during* 1,2c1,2 < [During I/O] < ============ --- > [During Failure] > ================ 18c18 < iSCSI Connection State: LOGGED IN --- > iSCSI Connection State: TRANSPORT WAIT 20c20 < Internal iscsid Session State: NO CHANGE --- > Internal iscsid Session State: REPOEN 24c24 < HeaderDigest: CRC32C --- > HeaderDigest: None 38c38 < Attached scsi disk sdb State: running --- > Attached scsi disk sdb State: blocked Notice the HeaderDigest ... I could not understand why (1) HeaderDigest was enabled during normal I/O, and (2) why it was then not set during recovery. So the google-sphere helped me find this patch: > commit 1bcd3db15d847c712e0566fb0a9e4edb195e198a > Author: Mike Christie <[email protected]> > Date: Sun Apr 20 05:19:49 2008 -0500 > > set header digests to off by default > > Signed-off-by: Mike Christie <[email protected]> Note that my iscsitarget software had defined my iSCSI target with: > Target iqn.2001-04.net.gonzoleeman:test.disk.laptop.001 > Lun 0 Path=/alt2/big_disc_file.img,Type=fileio > HeaderDigest None,CRC32C > DataDigest None,CRC32C After applying this patch to my SLES 10 SP4 source base, the problem does not occur. This together with the pre-patched open-iscsi code was negotiating CRC32C for the HeaderDigest ... Now, when I shutdown my iSCSI target during I/O, I get what seems to me better behavior. Here's the iscsid output from /var/log/messages: ... Jan 29 09:43:51 sles10vm iscsid: connection5:0 is operational now => Tue Jan 29 09:49:23 PST 2013: starting I/O ... => Tue Jan 29 09:49:59 PST 2013: killing target ... Jan 29 09:49:59 sles10vm kernel: connection5:0: iscsi: detected conn error (1011) Jan 29 09:50:00 sles10vm iscsid: Kernel reported iSCSI connection 5:0 error (1011) state (3) Jan 29 09:50:02 sles10vm iscsid: connect failed (111) Jan 29 09:50:36 sles10vm last message repeated 9 times Jan 29 09:51:40 sles10vm last message repeated 17 times Jan 29 09:51:55 sles10vm last message repeated 4 times Jan 29 09:51:59 sles10vm kernel: session5: iscsi: session recovery timed out after 120 secs Jan 29 09:51:59 sles10vm iscsid: connect failed (111) => Tue Jan 29 09:52:29 PST 2013: done with test Jan 29 09:52:29 sles10vm last message repeated 8 times Jan 29 09:52:55 sles10vm last message repeated 7 times ... As you can see, I *am* hitting the recovery timeout, which was the original problem, now fixed. The only worry I have now is that even though the target never comes back, I continue to get what appear to be recovery attempts, as witnessed by the continued "iscsid: connect failed (111)" messages. Is that normal? As always, thanks for your help in this issue! -- The Lee-Man Prepare to be unprepared. -- Me -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/open-iscsi?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
