RE: Target reboot -> iscsiadm rescan Stuck

Cale, Yonatan Tue, 29 Apr 2014 07:05:08 -0700

Hi,
I understand this problem is probably unrelated to your product, so you're not 
supposed to start debugging it, that makes sense (and also makes me sad :P )
So first of all thank you (and fusion-io) for sitting with me on this so far, 
it's not something trivial for me to expect.
I'd like to understand your iscsi-related analysis so far, so that I can 
continue the investigation by myself, if that's ok.
So:
You assume that an unanswered "report LUNs" that was sent from the scsi layer 
-> caused the scsi layer to inquiry each of the ~infinite possible LUNs 
separately.
-Can you tell me what "iscsiadm" calls - that is getting stuck? For example, 
are you calling "report LUNs" that the scsi layer is refusing to answer? 
-In the "messages_end" log I sent you, it has ~infinite "session1: 
iscsi_queuecommand iscsi: cmd 0x12 is not queued" prints. What is this? Is this 
related to the scsi layer querying all LUNs as you suggested?  (If not then why 
do you think the problem is in the scsi layer?)

In other words, I'm asking what's going on right over the scsi layer (with 
iscsiadm) and just below the scsi layer (iscsi layer), both are components that 
you probably know well - so that I can be on-target with my continued 
investigation.

So thanks again, and wish me luck
:)
Yonatan

-----Original Message-----
From: Mike Christie [mailto:[email protected]] 
Sent: Monday, April 28, 2014 4:11 AM
To: Cale, Yonatan
Cc: [email protected]; [email protected]
Subject: Re: Target reboot -> iscsiadm rescan Stuck

On 4/27/14, 10:38 AM, Cale, Yonatan wrote:
> Hi,
> Our sim module is above the scsi layer (not between the iscsi&scsi layers), 
> so I think this already rules out this guess.
>
> What we do is something like this:
> -Send scsi command
> -If we didn't get a response after X seconds, --Abort the command 
> (perhaps many times, if the abort fails)
>
> So.. We add some prints somewhere new?
>

I should have written *you* have to add some printks in the scsi/block layer :) 
As iscsi maintainer I am happy to help all vendors on iscsi related issue as 
you have seen in this thread, but I work for Fusion-io on their FC/SRP/iSCSI 
target, ION, so I do not have time to debug all kernel layers for a 
multi-billion dollar company like EMC :)

If I hit this problem with our product, I would look over the scsi scan code 
since we see those commands time out. I would look at the scsi scan code and 
see how it handled time out failures for report luns and inquirys.

Probably what the problem is, is that scsi layer tried to send a report luns, 
that timedout due to your target not responding for whatever reason, the scsi 
layer handled that by thinking that it failed because target does not support 
report luns and not due it just timing out, and scsi layer dropped down to a 
sequential scan as a result. So all those inquirys in the logs are not retries 
but instead the scsi layer trying to see if a lu is behind lun0, lun1, 
lun2....... lun(N = MAX_UNSIGNED_INT).

If that is not the problem, I would add debug code to the 
scsi_request_fn/scsi_dispatch_cmd and 
scsi_done/scsi_softirq_done/scsi_decide_disposition/scsi_finish_command/scsi_io_completion
to see why those inquirys are retried when they should be failed.

> I'd like to say again, that this bug happens with one version of VNX but not 
> with another version. Do you think that might give us a hint?
>

Yes. I would guess your other VNX versions reply to the scsi scan related IO, 
so we do not fall into this problem where the scsi scan IO timedout, and IO is 
now endlessly retried or we drop down a sequential scan. Again, if I worked for 
EMC, I would have compared the logs for different versions to see what behavior 
changed.

Hope this helps. If you have even the slightest hunch it is a iscsi code 
problem come back and bug me, because I really do not care what vendor you are 
from when fixing iscsi bugs.

> -----Original Message-----
> From: Mike Christie [mailto:[email protected]]
> Sent: Thursday, April 24, 2014 10:04 PM
> To: Cale, Yonatan
> Cc: [email protected]; [email protected]
> Subject: Re: Target reboot -> iscsiadm rescan Stuck
>
> On 04/22/2014 04:10 AM, Cale, Yonatan wrote:
>> -----Original Message-----
>> From: Mike Christie [mailto:[email protected]]
>> Sent: Tuesday, April 22, 2014 12:38 AM
>> To: Cale, Yonatan
>> Cc: [email protected]; [email protected]
>> Subject: Re: Target reboot -> iscsiadm rescan Stuck
>>
>>> Do you have some module that is hooking into the scsi layer or iscsi 
>>> modules? Just wondering what the "sim_try_to_abort_cmd" call is. Where are 
>>> you hooking in?
>> "sim" is our module that handles iscsi data-path. We hook for 
>> notifications in order to know if we should cancel a command
>
>
> Hey, does your sim module that handles the data path just monitor or 
> do you handle error codes that the iscsi modules returns. The problem 
> is that the iscsi layer is trying to fail a scsi scan related command, 
> but whatever layer is above it (I thought it was just the scsi layer 
> like normal in my other response) just keeps retrying it. Does your 
> module do anything to IO failed with
>
> #define DID_TRANSPORT_FAILFAST  0x0f /* Transport class fastfailed the 
> io */
>
> from the queuecommand path? Is it the one forcing the retry? That would 
> explain why we do not see anything from the scsi scan layer debug printks.
>
> If not, then it is the scsi or block layer and we will have to add some 
> printks in there.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.

RE: Target reboot -> iscsiadm rescan Stuck

Reply via email to