Akshay Lal wrote:
> Devs,
> 
>     We seem to having an issue with the time to failover over iSCSI.
> The end goal here being to force a failover within 10 seconds to an
> alternate path as defined by dm-multipath.
> 
> Distro: CentOS
> Kernel version: 2.6.29.5
> dm-multipath version: device-mapper-multipath-0.4.7-17.el5
> iscsid version: iscsi-initiator-utils-6.2.0.868-0.7.el5
> 
> We have dm-multipath installed and configured with the following
> configurations:
>                 udev_dir                     /dev
>                 polling_interval            3
>                 selector                      "round-robin 0"
>                 path_grouping_policy   failover
>                 getuid_callout             "/sbin/scsi_id -g -u -s /
> block/%n"
>                 prio_callout                 /bin/true
>                 path_checker              tur
>                 rr_min_io                    10
>                 max_fds                     8192
>                 rr_weight                    uniform
>                 failback                      manual
>                 no_path_retry             fail
>                 user_friendly_names   yes
> 
> We have also modified scsi PDU timeout:
>                 ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|
> 14", \
>                                 RUN+="/bin/sh -c 'echo 60 > /sys$
> $DEVPATH/timeout'"
> in the /etc/udev/rules.d/50-udev.rules.
> 
> We have also modified some parameters in /etc/iscsi/iscsi.conf:
>                 node.session.timeo.replacement_timeout = 5
>                 node.conn[0].timeo.login_timeout = 5
>                 node.conn[0].timeo.logout_timeout = 5
>                 node.conn[0].timeo.noop_out_interval = 5
>                 node.conn[0].timeo.noop_out_timeout = 10
>


When you run

iscsiadm -n node -T target_name | grep 
node.session.timeo.replacement_timeout

do you see 5 seconds?

When you do cat /sys/class/iscsi_session/sessionX/recovery_tmo do you 
see 5 seconds?


> Given the above configuration the failover takes place in 2 minutes.
> 
> 
> After reading a few posts on this group, I did try to change the scsi
> PDU timeout & the node.session.timeo.replacement_timeout, but still
> that didn't change the failover time.
> 
> However, if I modify the scsi PDU timout to 3, the node.conn


What do you mean by the scsi pdu timeout? Do you mean the scsi command 
timer?


> [0].timeo.noop_out_interval = 1 & node.conn[0].timeo.noop_out_timeout
> = 2, we seem to get a failover in about 65 seconds (thats still too
> long for our purposes)
> 
> /etc/iscsi/iscsi.conf:
>                 node.session.timeo.replacement_timeout = 120
>                 node.conn[0].timeo.login_timeout = 15
>                 node.conn[0].timeo.logout_timeout = 15
>                 node.conn[0].timeo.noop_out_interval = 1
>                 node.conn[0].timeo.noop_out_timeout = 2
> 
> /etc/udev/rules.d/50-udev.rules:
>                 ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|
> 14", \
>                                 RUN+="/bin/sh -c 'echo 60 > /sys$
> $DEVPATH/timeout'"
> 
> Not sure why these values actually cause a difference in the failover
> time, but apparently changing any other parameter doesn't really help.
> 
> /var/log/messages:
> When the cable (power) is pulled from the primary:
>      Jul  8 15:38:23 cschi-mbxdsg-0226.cleversafelabs.com kernel:
> connection2:0: ping timeout of 2 secs expired, last rx 4295169719,
> last ping 4295170719, now 4295172719
>      Jul  8 15:38:23 cschi-mbxdsg-0226.cleversafelabs.com kernel:
> connection2:0: detected conn error (1011)
>      Jul  8 15:38:24 cschi-mbxdsg-0226.cleversafelabs.com iscsid:
> Kernel reported iSCSI connection 2:0 error (1011) state (3)
> .

During this time if you run

iscsiadm -m session -P 3

do you see that the device state for sdb is blocked? Could you send the 
output of that command?



> .
> No messages for a bit and then: (the failover occurs at this point)
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
> 2:0:0:0: timing out command, waited 18s
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
> 2:0:0:0: [sdb] Unhandled error code
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
> 2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_DISRUPTED
> driverbyte=DRIVER_OK,SUGGEST_OK
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel:
> end_request: I/O error, dev sdb, sector 1599400
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel:
> device-mapper: multipath: Failing path 8:16.
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
> 2:0:0:0: timing out command, waited 18s
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
> 2:0:0:0: [sdb] Unhandled error code
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
> 2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_DISRUPTED
> driverbyte=DRIVER_OK,SUGGEST_OK
>      Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel:
> end_request: I/O error, dev sdb, sector 1600424
> 

Do you see a

session recovery timed out after %d secs

message in the /var/log/messages anywhere (%d would be replaced with the 
node.session.timeo.replacement_timeout value).

When you have the default scsi command timeout of 60 secs, and the 
failover takes 2 minutes do you see that message then the "end_request: 
I/O error" messages?

> 
> Any clues on how we can reduce this failover time would be
> appreciated.
> 
> 
> --
> 
> Akshay Lal
> 
> 
> > 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to