Re: device state remains running after taking port down

Mike Christie Mon, 08 Mar 2010 12:17:56 -0800

On 03/08/2010 05:40 AM, Or Gerlitz wrote:

Mike,


I am sure this used to work, and quite sure you made some comments on the 
subject
now, this isn't working and I can't find your comments, anyway:

using the suggested/similar setting for nop timesout as in the read me 
multi-pathing section,
taking a port down causes the nop watch-dog to catch the situation

14:51:33 cto-1 kernel: sd 4:0:0:1: [sdd] Attached SCSI disk
14:51:33 cto-1 kernel: sd 5:0:0:1: [sde] Attached SCSI disk
14:51:34 cto-1 iscsid: connection2:0 is operational now
14:51:34 cto-1 iscsid: connection3:0 is operational now
14:52:19 cto-1 kernel:  connection3:0: ping timeout of 10 secs expired, recv 
timeout 5 ...
14:52:19 cto-1 kernel:  connection3:0: detected conn error (1011)
14:52:20 cto-1 iscsid: Kernel reported iSCSI connection 3:0 error (1011) state 
(3)
14:52:35 cto-1 kernel:  session3: session recovery timed out after 15 secs



and change the session state such that $ iscsiadm -m session/host -P 3 yield
for the failed session/host

  iSCSI Connection State: TRANSPORT WAIT
                         iSCSI Session State: FREE
                         Internal iscsid Session State: REPOEN


where the session which is associated with the other port of the initiator 
which is UP gives

       iSCSI Connection State: LOGGED IN
                         iSCSI Session State: LOGGED_IN
                         Internal iscsid Session State: NO CHANGE


BUT, for some reason the SCSI host state is still "running" and not "blocked" 
even after 30 minutes. I remember this wasn't the case in the not far away past...  am I doing 
something wrong? don't tell me its RTFM...

I got this behaviour on both RHEL5.4 and 2.6.33 with iscsi/tcp and iser. I just 
run this without multipathing around. When multipathing was used (RHEL 5.4) the 
multipath driver was aware that the device is done as of the
path checker timeout. see more details on my config in the attach.

The host state is never going to change for this. The scsi host andsession state are completely different. It is just due the weirdness ofiscsi_tcp and ib_iser that we do a host per session. Normally drivers doa host per pci resource, and so if one session goes down the entire hoststate is not going to change due to just the one session down downbecause the host can recover the session while leaving the host in anormal state.

Although, the scsi_error.c eh is overly careful and will bring down theentire host when a cmd times out and that is when you would see the hoststate go into recovery. As we have discussed before the scsi eh has tosupport lots of simple drivers that cannot handle the eh runinng whilerunning normal IO so it will stop the entire host. This is why we forsession level recovery we try to prevent the scsi eh from running.

When the session initially goes down and is cleaned up (wheniscsi_conn_stop is called) it blocks the session. That puts the devicesin the blocked state. Then when the recovery/replacemnt timeout firesand you see


session3: session recovery timed out after 15 secs

The iscsi layer unblocks the devices (puts them in running state) andfails all IO to the devices.

On linux-scsi we have discussed that we need a new state for this.Something like the offline state. I have to reread the thread on whypeople do not like offline for that state (I think becuase offline meansofflined due to the scsi eh failing).


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: device state remains running after taking port down

Reply via email to