I'm running RHEL5 kernel 2.6.18 in a VM under VMware ESX 4.0.0, 261974. We have several VM that don't have any problem but they are all running some flavor of Windows. This is our only Linux box using a targeted iSCSI volume.
# yum info device-mapper-multipath Loaded plugins: rhnplugin, security Installed Packages Name : device-mapper-multipath Arch : x86_64 Version : 0.4.7 Release : 34.el5_5.6 Size : 6.9 M Repo : installed # yum info iscsi-initiator-utils Loaded plugins: rhnplugin, security Installed Packages Name : iscsi-initiator-utils Arch : x86_64 Version : 6.2.0.871 Release : 0.20.el5_5 Size : 1.9 M Repo : installed We are using multipath connections and one of the paths frequently gets ping timeouts (usually about every 5 - 15 minutes) and then the connection error which causes a problematic pause in IO while the multipathing recovers. The server is our campus web server and is running Apache, mysql, and Joomla. I've been a sysadmin for other OSes for many years but am new to Linux server administration and any help would be greatly appreciated. Below are more details. If there is anything else I can provide that would be helpful please let me know: # dmesg device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4355241960, last ping 4355246960, now 4355251960 connection2:0: detected conn error (1011) device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4355898393, last ping 4355903393, now 4355908393 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4357043430, last ping 4357048430, now 4357053430 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4358334347, last ping 4358339347, now 4358344347 connection2:0: detected conn error (1011) device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4358911119, last ping 4358916119, now 4358921119 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4360295386, last ping 4360300386, now 4360305386 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4360717303, last ping 4360722303, now 4360727303 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4361937494, last ping 4361942494, now 4361947494 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4363713740, last ping 4363718740, now 4363723740 connection2:0: detected conn error (1011) device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4364203932, last ping 4364208932, now 4364213932 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4365150602, last ping 4365155602, now 4365160602 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4365512045, last ping 4365517045, now 4365522045 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4365977352, last ping 4365982352, now 4365987352 connection2:0: detected conn error (1011) connection2:0: detected conn error (1019) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4366877780, last ping 4366882780, now 4366887780 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4367263655, last ping 4367268655, now 4367273655 connection2:0: detected conn error (1011) # tail /var/log/messages Nov 12 10:40:32 techcms kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4368811299, last ping 4368816299, now 4368821299 Nov 12 10:40:32 techcms kernel: connection2:0: detected conn error (1011) Nov 12 10:40:33 techcms multipathd: sdc: readsector0 checker reports path is down Nov 12 10:40:33 techcms multipathd: checker failed path 8:32 in map san-techcms Nov 12 10:40:33 techcms multipathd: san-techcms: remaining active paths: 1 Nov 12 10:40:33 techcms multipathd: dm-2: add map (uevent) Nov 12 10:40:33 techcms kernel: device-mapper: multipath: Failing path 8:32. Nov 12 10:40:33 techcms multipathd: dm-2: devmap already registered Nov 12 10:40:33 techcms iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Nov 12 10:40:55 techcms multipathd: sdc: readsector0 checker reports path is up Nov 12 10:40:55 techcms multipathd: 8:32: reinstated Nov 12 10:40:55 techcms multipathd: san-techcms: remaining active paths: 2 Nov 12 10:40:55 techcms multipathd: dm-2: add map (uevent) Nov 12 10:40:55 techcms multipathd: dm-2: devmap already registered Nov 12 10:40:55 techcms iscsid: connection2:0 is operational after recovery (2 attempts) Nov 12 10:43:25 techcms kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4368984192, last ping 4368989192, now 4368994192 Nov 12 10:43:25 techcms kernel: connection2:0: detected conn error (1011) Nov 12 10:43:26 techcms iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Nov 12 10:43:29 techcms iscsid: connection2:0 is operational after recovery (1 attempts) # iscsiadm -m session -P3 iSCSI Transport Class version 2.0-871 version 2.0-871 Target: iqn.2001-05.com.equallogic:0-8a0906-d97510304-d8300001c3a4bcf6- techcms Current Portal: 192.168.2.55:3260,1 Persistent Portal: 192.168.2.22:3260,1 ********** Interface: ********** Iface Name: eth1 Iface Transport: tcp Iface Initiatorname: iqn.2010-04.edu.tntech:techcms.tntech.edu Iface IPaddress: 192.168.3.232 Iface HWaddress: 00:50:56:81:4C:8E Iface Netdev: <empty> SID: 1 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE ************************ Negotiated iSCSI params: ************************ HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 262144 MaxXmitDataSegmentLength: 65536 FirstBurstLength: 65536 MaxBurstLength: 262144 ImmediateData: Yes InitialR2T: No MaxOutstandingR2T: 1 ************************ Attached SCSI devices: ************************ Host Number: 1 State: running scsi1 Channel 00 Id 0 Lun: 0 Attached scsi disk sdb State: running Current Portal: 192.168.2.48:3260,1 Persistent Portal: 192.168.2.22:3260,1 ********** Interface: ********** Iface Name: eth2 Iface Transport: tcp Iface Initiatorname: iqn.2010-04.edu.tntech:techcms.tntech.edu Iface IPaddress: 192.168.3.233 Iface HWaddress: 00:50:56:81:02:19 Iface Netdev: <empty> SID: 2 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE ************************ Negotiated iSCSI params: ************************ HeaderDigest: None DataDigest: None MaxRecvDataSegmentLength: 262144 MaxXmitDataSegmentLength: 65536 FirstBurstLength: 65536 MaxBurstLength: 262144 ImmediateData: Yes InitialR2T: No MaxOutstandingR2T: 1 ************************ Attached SCSI devices: ************************ Host Number: 2 State: running scsi2 Channel 00 Id 0 Lun: 0 Attached scsi disk sdc State: running Here's a brief dstat session that might give you an idea of typical data rates but of course they can vary quite a bit. # dstat -cndmD total,hda,sda -N eth1,eth2 ----total-cpu-usage---- --net/eth1- --net/eth2- -dsk/total- --dsk/ hda-- --dsk/sda-- ------memory-usage----- usr sys idl wai hiq siq| recv send: recv send| read writ: read writ: read writ| used buff cach free 17 1 81 1 0 0| 0 0 : 0 0 | 180k 211k: 5.4B 0 : 32k 18k|2769M 242M 9135M 3665M 24 1 74 0 0 0|9424B 495k:8284B 489k| 0 972k: 0 0 : 0 0 |2769M 242M 9135M 3664M 23 3 74 0 0 1| 604B 0 : 604B 0 | 0 0 : 0 0 : 0 0 |2771M 242M 9135M 3662M 2 0 98 0 0 0| 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |2771M 242M 9135M 3662M 1 0 99 0 0 0| 604B 0 : 604B 0 | 0 512k: 0 0 : 0 256k|2771M 242M 9135M 3662M 5 0 94 0 0 0| 906B 0 : 906B 0 | 0 0 : 0 0 : 0 0 |2773M 242M 9135M 3660M 15 1 84 0 0 0| 114B 180B: 806B 246B| 0 0 : 0 0 : 0 0 |2820M 242M 9135M 3613M 0 0 99 0 0 0|8162B 359k:7130B 332k| 0 680k: 0 0 : 0 0 |2820M 242M 9135M 3613M 11 0 88 0 0 0| 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |2821M 242M 9135M 3612M 24 2 75 0 0 0| 626B 180B: 0 0 | 0 56k: 0 0 : 0 28k|2821M 242M 9135M 3612M 24 2 74 0 0 0| 906B 0 : 906B 0 | 0 0 : 0 0 : 0 0 |2822M 242M 9135M 3611M 13 0 86 0 0 0| 664B 0 : 25k 312B| 24k 0 : 0 0 : 0 0 |2823M 242M 9135M 3610M 24 1 74 0 0 1| 22k 820k: 26k 816k| 0 1596k: 0 0 : 0 0 |2823M 242M 9135M 3610M 20 3 76 0 0 1| 302B 0 : 302B 0 | 0 0 : 0 0 : 0 0 |2871M 242M 9135M 3562M 10 1 90 0 0 0| 0 0 : 0 0 | 0 40k: 0 0 : 0 20k|2871M 242M 9135M 3562M 19 2 79 0 0 0| 604B 0 : 604B 0 | 0 0 : 0 0 : 0 0 |2872M 242M 9135M 3561M 12 1 87 0 0 0| 0 0 : 0 0 | 0 0 : 0 0 : 0 0 |2872M 242M 9135M 3561M 1 0 98 0 0 0|6324B 405k:8370B 415k| 0 808k: 0 0 : 0 0 |2872M 242M 9135M 3561M 19 1 79 0 0 0| 92B 0 : 92B 0 | 0 0 : 0 0 : 0 0 |2874M 242M 9135M 3559M 10 1 90 0 0 0| 788B 0 : 788B 0 | 0 0 : 0 0 : 0 0 |2874M 242M 9135M 3558M 1 0 99 0 0 0| 0 0 : 0 0 | 0 40k: 0 0 : 0 20k|2874M 242M 9135M 3558M 0 0 100 0 0 0| 604B 0 : 604B 0 | 0 0 : 0 0 : 0 0 |2874M 242M 9135M 3558M 0 0 99 0 0 0|5244B 227k:4308B 227k| 0 1024k: 0 0 : 0 288k|2874M 242M 9135M 3559M /etc/iscsi/iscsid.conf -- I modified the FastAbort setting at the end in an effort to resolve these problems but it has not helped. # # Open-iSCSI default configuration. # Could be located at /etc/iscsi/iscsid.conf or ~/.iscsid.conf # # Note: To set any of these values for a specific node/session run # the iscsiadm --mode node --op command for the value. See the README # and man page for iscsiadm for details on the --op command. # ################ # iSNS settings ################ # Address of iSNS server #isns.address = 192.168.0.1 #isns.port = 3205 ############################# # NIC/HBA and driver settings ############################# # open-iscsi can create a session and bind it to a NIC/HBA. # To set this up see the example iface config file. #***************** # Startup settings #***************** # To request that the iscsi initd scripts startup a session set to "automatic". # node.startup = automatic # # To manually startup the session set to "manual". The default is automatic. node.startup = automatic # ************* # CHAP Settings # ************* # To enable CHAP authentication set node.session.auth.authmethod # to CHAP. The default is None. #node.session.auth.authmethod = CHAP # To set a CHAP username and password for initiator # authentication by the target(s), uncomment the following lines: #node.session.auth.username = username #node.session.auth.password = password # To set a CHAP username and password for target(s) # authentication by the initiator, uncomment the following lines: #node.session.auth.username_in = username_in #node.session.auth.password_in = password_in # To enable CHAP authentication for a discovery session to the target # set discovery.sendtargets.auth.authmethod to CHAP. The default is None. #discovery.sendtargets.auth.authmethod = CHAP # To set a discovery session CHAP username and password for the initiator # authentication by the target(s), uncomment the following lines: #discovery.sendtargets.auth.username = username #discovery.sendtargets.auth.password = password # To set a discovery session CHAP username and password for target(s) # authentication by the initiator, uncomment the following lines: #discovery.sendtargets.auth.username_in = username_in #discovery.sendtargets.auth.password_in = password_in # ******** # Timeouts # ******** # # See the iSCSI REAME's Advanced Configuration section for tips # on setting timeouts when using multipath or doing root over iSCSI. # # To specify the length of time to wait for session re-establishment # before failing SCSI commands back to the application when running # the Linux SCSI Layer error handler, edit the line. # The value is in seconds and the default is 120 seconds. node.session.timeo.replacement_timeout = 120 # To specify the time to wait for login to complete, edit the line. # The value is in seconds and the default is 15 seconds. node.conn[0].timeo.login_timeout = 15 # To specify the time to wait for logout to complete, edit the line. # The value is in seconds and the default is 15 seconds. node.conn[0].timeo.logout_timeout = 15 # Time interval to wait for on connection before sending a ping. node.conn[0].timeo.noop_out_interval = 5 # To specify the time to wait for a Nop-out response before failing # the connection, edit this line. Failing the connection will # cause IO to be failed back to the SCSI layer. If using dm-multipath # this will cause the IO to be failed to the multipath layer. node.conn[0].timeo.noop_out_timeout = 5 # To specify the time to wait for abort response before # failing the operation and trying a logical unit reset edit the line. # The value is in seconds and the default is 15 seconds. node.session.err_timeo.abort_timeout = 15 # To specify the time to wait for a logical unit response # before failing the operation and trying session re-establishment # edit the line. # The value is in seconds and the default is 30 seconds. node.session.err_timeo.lu_reset_timeout = 20 #****** # Retry #****** # To specify the number of times iscsid should retry a login # if the login attempt fails due to the node.conn[0].timeo.login_timeout # expiring modify the following line. Note that if the login fails # quickly (before node.conn[0].timeo.login_timeout fires) because the network # layer or the target returns an error, iscsid may retry the login more than # node.session.initial_login_retry_max times. # # This retry count along with node.conn[0].timeo.login_timeout # determines the maximum amount of time iscsid will try to # establish the initial login. node.session.initial_login_retry_max is # multiplied by the node.conn[0].timeo.login_timeout to determine the # maximum amount. # # The default node.session.initial_login_retry_max is 8 and # node.conn[0].timeo.login_timeout is 15 so we have: # # node.conn[0].timeo.login_timeout * node.session.initial_login_retry_max = # 120 seconds # # Valid values are any integer value. This only # affects the initial login. Setting it to a high value can slow # down the iscsi service startup. Setting it to a low value can # cause a session to not get logged into, if there are distuptions # during startup or if the network is not ready at that time. node.session.initial_login_retry_max = 8 ################################ # session and device queue depth ################################ # To control how many commands the session will queue set # node.session.cmds_max to an integer between 2 and 2048 that is also # a power of 2. The default is 128. node.session.cmds_max = 128 # To control the device's queue depth set node.session.queue_depth # to a value between 1 and 1024. The default is 32. node.session.queue_depth = 32 #*************** # iSCSI settings #*************** # To enable R2T flow control (i.e., the initiator must wait for an R2T # command before sending any data), uncomment the following line: # #node.session.iscsi.InitialR2T = Yes # # To disable R2T flow control (i.e., the initiator has an implied # initial R2T of "FirstBurstLength" at offset 0), uncomment the following line: # # The defaults is No. node.session.iscsi.InitialR2T = No # # To disable immediate data (i.e., the initiator does not send # unsolicited data with the iSCSI command PDU), uncomment the following line: # #node.session.iscsi.ImmediateData = No # # To enable immediate data (i.e., the initiator sends unsolicited data # with the iSCSI command packet), uncomment the following line: # # The default is Yes node.session.iscsi.ImmediateData = Yes # To specify the maximum number of unsolicited data bytes the initiator # can send in an iSCSI PDU to a target, edit the following line. # # The value is the number of bytes in the range of 512 to (2^24-1) and # the default is 262144 node.session.iscsi.FirstBurstLength = 262144 # To specify the maximum SCSI payload that the initiator will negotiate # with the target for, edit the following line. # # The value is the number of bytes in the range of 512 to (2^24-1) and # the defauls it 16776192 node.session.iscsi.MaxBurstLength = 16776192 # To specify the maximum number of data bytes the initiator can receive # in an iSCSI PDU from a target, edit the following line. # # The value is the number of bytes in the range of 512 to (2^24-1) and # the default is 262144 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 # To specify the maximum number of data bytes the initiator can receive # in an iSCSI PDU from a target during a discovery session, edit the # following line. # # The value is the number of bytes in the range of 512 to (2^24-1) and # the default is 32768 # discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768 # To allow the targets to control the setting of the digest checking, # with the initiator requesting a preference of enabling the checking, uncomment # the following lines (Data digests are not supported and on ppc/ppc64 # both header and data digests are not supported.): #node.conn[0].iscsi.HeaderDigest = CRC32C,None # # To allow the targets to control the setting of the digest checking, # with the initiator requesting a preference of disabling the checking, # uncomment the following lines: #node.conn[0].iscsi.HeaderDigest = None,CRC32C # # To enable CRC32C digest checking for the header and/or data part of # iSCSI PDUs, uncomment the following lines: #node.conn[0].iscsi.HeaderDigest = CRC32C # # To disable digest checking for the header and/or data part of # iSCSI PDUs, uncomment the following lines: #node.conn[0].iscsi.HeaderDigest = None # # The default is to never use DataDigests or HeaderDigests. # node.conn[0].iscsi.HeaderDigest = None #************ # Workarounds #************ # Some targets like IET prefer after an initiator has sent a task # management function like an ABORT TASK or LOGICAL UNIT RESET, that # it does not respond to PDUs like R2Ts. To enable this behavior uncomment # the following line (The default behavior is Yes): # node.session.iscsi.FastAbort = Yes # Some targets like Equalogic prefer that after an initiator has sent # a task management function like an ABORT TASK or LOGICAL UNIT RESET, that # it continue to respond to R2Ts. To enable this uncomment this line node.session.iscsi.FastAbort = No -------------------- /etc/multipath.conf # This is a basic configuration file with some examples, for device mapper # multipath. # For a complete list of the default configuration values, see # /usr/share/doc/device-mapper-multipath-0.4.7/multipath.conf.defaults # For a list of configuration options with descriptions, see # /usr/share/doc/device-mapper-multipath-0.4.7/ multipath.conf.annotated # Blacklist all devices by default. Remove this to enable multipathing # on the default devices. #blacklist { # devnode "*" #} ## By default, devices with vendor = "IBM" and product = "S/390.*" are ## blacklisted. To enable mulitpathing on these devies, uncomment the ## following lines. #blacklist_exceptions { # device { # vendor "IBM" # product "S/390.*" # } #} ## Use user friendly names, instead of using WWIDs as names. defaults { user_friendly_names yes } ## ## Here is an example of how to configure some standard options. ## # #defaults { # udev_dir /dev # polling_interval 10 # selector "round-robin 0" # path_grouping_policy multibus # getuid_callout "/sbin/scsi_id -g -u -s /block/%n" # prio_callout /bin/true # path_checker readsector0 # rr_min_io 100 # max_fds 8192 # rr_weight priorities # failback immediate # no_path_retry fail # user_friendly_names yes #} ## ## The wwid line in the following blacklist section is shown as an example ## of how to blacklist devices by wwid. The 2 devnode lines are the ## compiled in default blacklist. If you want to blacklist entire types ## of devices, such as all scsi devices, you should use a devnode line. ## However, if you want to blacklist specific devices, you should use ## a wwid line. Since there is no guarantee that a specific device will ## not change names on reboot (from /dev/sda to /dev/sdb for example) ## devnode lines are not recommended for blacklisting specific devices. ## #blacklist { # wwid 26353900f02796769 # devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" # devnode "^hd[a-z]" devnode sda #} blacklist { devnode "^sd[a]$" } multipaths { multipath { wwid 36090a048301075d9f6bca4c3010030d8 alias san-techcms path_grouping_policy multibus path_checker readsector0 path_selector "round-robin 0" failback manual rr_weight priorities no_path_retry 5 rr_min_io 10 } # multipath { # wwid 1DEC_____321816758474 # alias red # } } #devices { # device { # vendor "COMPAQ " # product "HSV110 (C)COMPAQ" # path_grouping_policy multibus # getuid_callout "/sbin/scsi_id -g -u -s /block/%n" # path_checker readsector0 # path_selector "round-robin 0" # hardware_handler "0" # failback 15 # rr_weight priorities # no_path_retry queue # } # device { # vendor "COMPAQ " # product "MSA1000 " # path_grouping_policy multibus # } #} Michael W. Wheeler, OpenVMS, Windows, Macintosh Systems Support, Tennessee Technological University -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.