Re: SUSE open-iscsi bug on replacement_timeout [resend]

Mike Christie Thu, 31 Jan 2013 23:11:13 -0800

On 01/29/2013 12:30 PM, Lee Duncan wrote:
> Hi Mike:
> 
> I think I may have figured out my problem, at least partly  ... answers below.
> 
> On Jan 23, 2013, at 1:58 PM, Mike Christie <[email protected]> wrote:
> 
>> On 01/17/2013 07:18 PM, Lee Duncan wrote:
>>>>> Yeah, that should trigger it. Are you seeing IO failed with DID_BUS_BUSY
>>>>> too like in the novel bugzilla?
>>> I never saw "DID_BUS_BUSY" in the logs attached to the original bug report,
>>> and I don't see any such message in /var/log/messages now. I know my kernel
>>> has the "DID_BUS_BUSY" code present, but I just don't know how to tell if
>>> anything is returning that or not.
>>>
>>
>> In older SLES kernels we just got the hex value of the errors variable. So:
>>
>> Oct  3 19:34:10 IBMx3250-200-174 kernel: sd 1:0:0:0: SCSI error: return
>> code = 0x00020000
>>
>> is DID_BUS_BUSY right?
> 
> Yes, of course you are correct. :) I was looking for the string itself. (Duh!)
> 
>>
>>
>> In the log you sent I do not see any errors except:
>>
>> Jan 16 13:02:49 sles10vm kernel: sd 2:0:0:0: timing out command, waited 60s
>>
>> This sort of makes sense because it looks like the failure has lasted a
>> minute (start of error below):
>>
>> Jan 16 13:01:49 sles10vm kernel:  connection2:0: iscsi: detected conn
>> error (1011)
>>
>> What does not make sense is why that command is floating around getting
>> retried and hitting that time check. In the upstream code the IO is in a
>> blocked queue so it should not hit that check while blocked. While
>> replacement/recovery timeout has not expired then it should just be
>> sitting in the queue and not hit that timeout out command code path.
>>
>> When you run the test, before IO is failed with that error, is the iscsi
>> device in the state "blocked"? You can run iscsiadm -m session -P 3 to
>> see the states.
> 
> Thank you for this wonderful clue! I should have been looking at the state
> of the session during this error myself.
> 
> [Note that I also changed my I/O for this test from sg_io to regular 
> (non-buffered)
> reads, as you suggested in another email to me.]
> 
> As to the iscsi state during my test, instead of showing the "-P 3" state for 
> before and after,
> what was more interesting was the diff between the two states:
> 
> zsh# diff *during*
> 1,2c1,2
> < [During I/O]
> < ============
> ---
>> [During Failure]
>> ================
> 18c18
> <               iSCSI Connection State: LOGGED IN
> ---
>>               iSCSI Connection State: TRANSPORT WAIT
> 20c20
> <               Internal iscsid Session State: NO CHANGE
> ---
>>               Internal iscsid Session State: REPOEN
> 24c24
> <               HeaderDigest: CRC32C
> ---
>>               HeaderDigest: None
> 38c38
> <                       Attached scsi disk sdb          State: running
> ---
>>                       Attached scsi disk sdb          State: blocked
> 
> 
> Notice the HeaderDigest ... I could not understand why (1) HeaderDigest
> was enabled during normal I/O, and (2) why it was then not set during
> recovery.
> 
> So the google-sphere helped me find this patch:
> 
>> commit 1bcd3db15d847c712e0566fb0a9e4edb195e198a
>> Author: Mike Christie <[email protected]>
>> Date:   Sun Apr 20 05:19:49 2008 -0500
>>
>>     set header digests to off by default
>>     
>>     Signed-off-by: Mike Christie <[email protected]>
> 
> Note that my iscsitarget software had defined my iSCSI target with:
> 
>> Target iqn.2001-04.net.gonzoleeman:test.disk.laptop.001
>>      Lun 0 Path=/alt2/big_disc_file.img,Type=fileio
>>      HeaderDigest            None,CRC32C
>>      DataDigest                      None,CRC32C
> 
> 
> 
> After applying this patch to my SLES 10 SP4 source base, the problem
> does not occur. This together with the pre-patched open-iscsi code was
> negotiating CRC32C for the HeaderDigest ...
> 
> Now, when I shutdown my iSCSI target during I/O, I get what seems to me
> better behavior. Here's the iscsid output from /var/log/messages:
> 
> ...
> Jan 29 09:43:51 sles10vm iscsid: connection5:0 is operational now
> => Tue Jan 29 09:49:23 PST 2013: starting I/O ...
> => Tue Jan 29 09:49:59 PST 2013: killing target ...
> Jan 29 09:49:59 sles10vm kernel:  connection5:0: iscsi: detected conn error 
> (1011)
> Jan 29 09:50:00 sles10vm iscsid: Kernel reported iSCSI connection 5:0 error 
> (1011) state (3)
> Jan 29 09:50:02 sles10vm iscsid: connect failed (111)
> Jan 29 09:50:36 sles10vm last message repeated 9 times
> Jan 29 09:51:40 sles10vm last message repeated 17 times
> Jan 29 09:51:55 sles10vm last message repeated 4 times
> Jan 29 09:51:59 sles10vm kernel:  session5: iscsi: session recovery timed out 
> after 120 secs
> Jan 29 09:51:59 sles10vm iscsid: connect failed (111)
> => Tue Jan 29 09:52:29 PST 2013: done with test
> Jan 29 09:52:29 sles10vm last message repeated 8 times
> Jan 29 09:52:55 sles10vm last message repeated 7 times
> ...
> 
> As you can see, I *am* hitting the recovery timeout, which was the original 
> problem, now
> fixed.
> 
> The only worry I have now is that even though the target never comes back, I 
> continue
> to get what appear to be recovery attempts, as witnessed by the continued 
> "iscsid:
> connect failed (111)" messages. Is that normal?
>


Sorry. Missed that question at the bottom. Yes. It is normal. The
intiator will try to connect until you run iscsiadm to tell it to stop.
We used to stop if we got certain errors from the target, but we took
that out because different targets were doing different things and I
couldn't keep it straight. So we now just continue to try to reconnect
so that the case where you user needs to get access to the data when the
target comes back we will be able to do that.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Re: SUSE open-iscsi bug on replacement_timeout [resend]

Reply via email to