Mike, Just as an FYI (in case you were most curious about this issue) I've narrowed this issue down to something with CHAP. On my EqualLogic, if I disable CHAP, I can't reproduce this issue.
So I did the following. I after upgrading to the latest OEL 5.3 release of the iscsi-initiator, I could still reproduce the problem. Therefore, I did the following: 1) Setup another test environment using the same hardware (physical different hardware, but all same firmware, models, etc..) 2) presented a new test volume from EqualLogic 3) ran the ping test (ping -Ieth2 192.168.0.19 & ping -Ieth3 192.168.0.19). 4) I couldn't reproduce the issue. 5) I checked what the difference were-- CHAP the only difference. 6) So I turned on CHAP authentication to the volume. 7) rm -rf /var/lib/iscsi/nodes/* /var/lib/iscsi/send_targets/* 8) rediscovered targets (after modifying /etc/iscsi/iscsid.conf with CHAP information) node.session.auth.authmethod = CHAP node.session.auth.username = mychapuserhere node.session.auth.password = mypasshere 9) ran the same ping test and was able to get iscsi sessions to fail within 2 minutes. 10) I wanted to prove that CHAP was the issue. So I logout out of all iscsi sessions. 11) I disabled CHAP on the EqualLogic 12) rediscovery targets and re-logged in to the sessions (without CHAP authentication) 13) ran the ping tests and couldn't break it after 30 minutes. 14) added CHAP again and was able to break the sessions within 2 minutes. So definitely something odd with CHAP (my guess, either in open-iscsi code or EqualLogic code). I've asked Roger Lopez, from Dell, to attempt to reproduce this in his lab. He has EqualLogic and Oracle VM Servers. Oracle developers that I'm working with don't currently have access to an EqualLogic, but they are attempting to reproduce this with their iSCSI equipment as well. I'm going to setup port mirroring on our switch and run tcpdumps to see what I can get. Thanks for all you help so for. Everyone that I've been working with on this issue has been great help! Thanks, Joe -----Original Message----- From: email@example.com [mailto:open-is...@googlegroups.com] On Behalf Of Mike Christie Sent: Saturday, July 11, 2009 7:42 PM To: firstname.lastname@example.org Subject: Re: iscsiadm -m iface + routing On 07/10/2009 01:06 PM, Hoot, Joseph wrote: > Mike, > > I have some more details on this. It seems that a simple `ping -Ieth2 > -i1 192.168.0.19`<-- our group IP to the EqualLogic is able to "reset > sessions." > > eth0 = 1 active nic in the bond (public network) > eth2 = iface eth2 (192.168.0.151/16) > eth3 = iface eth3 (192.168.0.161/16) > > I slammed the public network for that system from 3 external systems at > roughly 101MB/s<-- very nicely slammed for gigabit :) with netcat's to > /dev/null. > > I had 8 netcat connections going through public network for about 25 > minutes without a single hiccup (as expected). > > For the iSCSI side I had previously done performance testing with dt as > well as dd with bs=1M and was slamming the EqualLogic storage getting > around 60MB/s reads on average with (2) systems each having OCFS2 shared > storage and (2) iSCSI sessions each. Writes were between 30MB/s and > 155MB/s depending on which EqualLogic array was being hit (SATA vs > SAS15k respectively). This seemed to work well with a read and a write > going on simultaneously for about 2 hours. > > As soon as I introduce pings: > [r...@oim6102501 ~]# ping -Ieth2 192.168.0.19& ping -I eth3 192.168.0.19 > [r...@oim6102504 ~]# ping -Ieth2 192.168.0.19& ping -I eth3 192.168.0.19 > > I receive the following sessions failing, according to the EqualLogic > INFO 7/10/09 11:02:02 AM > SATA001 iSCSI session to target '192.168.0.30:3260, > iqn.2001-05.com.equallogic:0-8a0906-82f16c402-fe30000b33e4a3bc-ovm-1-lun > 0' > from initiator '192.168.0.161:45531, > iqn.1994-05.com.redhat:c79dbacd466' was closed. > iSCSI initiator connection failure. Reset received on the > connection. > > > Or according to /var/log/messages on my OVM Server: > > Jul 10 11:02:12 oim6102501 kernel: ping timeout of 10 secs expired, last > rx 16848993, last ping 16851493, now 16852743 The target is getting the errors because the initiator's iscsi pings (nops) are not completing within those noop values I described in the last mail. I have no idea why a network ping would cause the iscsi ping to fail. Maybe it is causing something to go wrong in the network routing. I really have no idea at this point though. I have never seen this before. If you were slamming the network while running the iscsi traffic then this could cause the iscsi pings to take longer than noop_timeout seconds due to the nop getting stuck behind a long scsi/iscsi command and the non iscsi network test slowing down the iscsi traffic. However, just doing the ping commands above should not cause a problem. If you turn off nops completely by setting those two noop values to zero: node.conn.timeo.noop_out_interval = 0 node.conn.timeo.noop_out_timeout = 0 you should not see the ping timeout errors. But do you then see a "Host reset succeeded" message? One other question. It looks like the iscsi code you are using is from code based on 5.2. There was a bug in there where we would think a ping timedout when it did not. I do not think you are hitting this, but if you could make sure that you are running something based on Red Hat's 5.1 it could rule that out. > Jul 10 11:02:12 oim6102501 kernel: connection1:0: iscsi: detected conn > error (1011) > Jul 10 11:02:12 oim6102501 iscsid: Kernel reported iSCSI connection 1:0 > error (1011) state (3) > Jul 10 11:02:27 oim6102501 kernel: iscsi: cmd 0x28 is not queued (8) > Jul 10 11:02:27 oim6102501 kernel: session1: iscsi: session recovery > timed out after 15 secs > Jul 10 11:02:27 oim6102501 kernel: sd 5:0:0:0: SCSI error: return code = > 0x00010000 > > As soon as I do `killall ping`, within 1 minute the session will > reconnect and dm-multipath will be happy again. > > So I'm wondering two things here: > > 1) I looked at the changelog between rpms. I've included them below > (actually Tom from Oracle did, but I'm just relaying this) and don't see > any specific bug that talks about the "pdus with cmd sequences out of > order." I did a google search and found a bunch of changelog info here > http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.29 but > couldn't find the specific pdus with cmd sequences. Would you mind > pointing me to a publicly available bug repo where I can dig further on > this? Or you if you happen to know the bug number I can do searches on > that as well. > I do not have a red hat bugzilla. Here is the upstream commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi t;h=77a23c21aaa723f6b0ffc4a701be8c8e5a32346d I do not think you are hitting this problem though. If you were you would not see that iscsi ping timeout message. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to email@example.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~---