Hi Mike:

I think I may have figured out my problem, at least partly  ... answers below.

On Jan 23, 2013, at 1:58 PM, Mike Christie <[email protected]> wrote:

> On 01/17/2013 07:18 PM, Lee Duncan wrote:
>>>> Yeah, that should trigger it. Are you seeing IO failed with DID_BUS_BUSY
>>>> too like in the novel bugzilla?
>> I never saw "DID_BUS_BUSY" in the logs attached to the original bug report,
>> and I don't see any such message in /var/log/messages now. I know my kernel
>> has the "DID_BUS_BUSY" code present, but I just don't know how to tell if
>> anything is returning that or not.
>> 
> 
> In older SLES kernels we just got the hex value of the errors variable. So:
> 
> Oct  3 19:34:10 IBMx3250-200-174 kernel: sd 1:0:0:0: SCSI error: return
> code = 0x00020000
> 
> is DID_BUS_BUSY right?

Yes, of course you are correct. :) I was looking for the string itself. (Duh!)

> 
> 
> In the log you sent I do not see any errors except:
> 
> Jan 16 13:02:49 sles10vm kernel: sd 2:0:0:0: timing out command, waited 60s
> 
> This sort of makes sense because it looks like the failure has lasted a
> minute (start of error below):
> 
> Jan 16 13:01:49 sles10vm kernel:  connection2:0: iscsi: detected conn
> error (1011)
> 
> What does not make sense is why that command is floating around getting
> retried and hitting that time check. In the upstream code the IO is in a
> blocked queue so it should not hit that check while blocked. While
> replacement/recovery timeout has not expired then it should just be
> sitting in the queue and not hit that timeout out command code path.
> 
> When you run the test, before IO is failed with that error, is the iscsi
> device in the state "blocked"? You can run iscsiadm -m session -P 3 to
> see the states.

Thank you for this wonderful clue! I should have been looking at the state
of the session during this error myself.

[Note that I also changed my I/O for this test from sg_io to regular 
(non-buffered)
reads, as you suggested in another email to me.]

As to the iscsi state during my test, instead of showing the "-P 3" state for 
before and after,
what was more interesting was the diff between the two states:

zsh# diff *during*
1,2c1,2
< [During I/O]
< ============
---
> [During Failure]
> ================
18c18
<               iSCSI Connection State: LOGGED IN
---
>               iSCSI Connection State: TRANSPORT WAIT
20c20
<               Internal iscsid Session State: NO CHANGE
---
>               Internal iscsid Session State: REPOEN
24c24
<               HeaderDigest: CRC32C
---
>               HeaderDigest: None
38c38
<                       Attached scsi disk sdb          State: running
---
>                       Attached scsi disk sdb          State: blocked


Notice the HeaderDigest ... I could not understand why (1) HeaderDigest
was enabled during normal I/O, and (2) why it was then not set during
recovery.

So the google-sphere helped me find this patch:

> commit 1bcd3db15d847c712e0566fb0a9e4edb195e198a
> Author: Mike Christie <[email protected]>
> Date:   Sun Apr 20 05:19:49 2008 -0500
> 
>     set header digests to off by default
>     
>     Signed-off-by: Mike Christie <[email protected]>

Note that my iscsitarget software had defined my iSCSI target with:

> Target iqn.2001-04.net.gonzoleeman:test.disk.laptop.001
>       Lun 0 Path=/alt2/big_disc_file.img,Type=fileio
>       HeaderDigest            None,CRC32C
>       DataDigest                      None,CRC32C



After applying this patch to my SLES 10 SP4 source base, the problem
does not occur. This together with the pre-patched open-iscsi code was
negotiating CRC32C for the HeaderDigest ...

Now, when I shutdown my iSCSI target during I/O, I get what seems to me
better behavior. Here's the iscsid output from /var/log/messages:

...
Jan 29 09:43:51 sles10vm iscsid: connection5:0 is operational now
=> Tue Jan 29 09:49:23 PST 2013: starting I/O ...
=> Tue Jan 29 09:49:59 PST 2013: killing target ...
Jan 29 09:49:59 sles10vm kernel:  connection5:0: iscsi: detected conn error 
(1011)
Jan 29 09:50:00 sles10vm iscsid: Kernel reported iSCSI connection 5:0 error 
(1011) state (3)
Jan 29 09:50:02 sles10vm iscsid: connect failed (111)
Jan 29 09:50:36 sles10vm last message repeated 9 times
Jan 29 09:51:40 sles10vm last message repeated 17 times
Jan 29 09:51:55 sles10vm last message repeated 4 times
Jan 29 09:51:59 sles10vm kernel:  session5: iscsi: session recovery timed out 
after 120 secs
Jan 29 09:51:59 sles10vm iscsid: connect failed (111)
=> Tue Jan 29 09:52:29 PST 2013: done with test
Jan 29 09:52:29 sles10vm last message repeated 8 times
Jan 29 09:52:55 sles10vm last message repeated 7 times
...

As you can see, I *am* hitting the recovery timeout, which was the original 
problem, now
fixed.

The only worry I have now is that even though the target never comes back, I 
continue
to get what appear to be recovery attempts, as witnessed by the continued 
"iscsid:
connect failed (111)" messages. Is that normal?

As always, thanks for your help in this issue!
-- 
The Lee-Man
                                Prepare to be unprepared. -- Me

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to