Mike,

Just as an FYI (in case you were most curious about this issue) I've
narrowed this issue down to something with CHAP.  On my EqualLogic, if I
disable CHAP, I can't reproduce this issue.  

So I did the following.  I after upgrading to the latest OEL 5.3 release
of the iscsi-initiator, I could still reproduce the problem.  Therefore,
I did the following:

1) Setup another test environment using the same hardware (physical
different hardware, but all same firmware, models, etc..)
2) presented a new test volume from EqualLogic
3) ran the ping test (ping -Ieth2 192.168.0.19 & ping -Ieth3
192.168.0.19).
4) I couldn't reproduce the issue.
5) I checked what the difference were-- CHAP the only difference.
6) So I turned on CHAP authentication to the volume.
7) rm -rf /var/lib/iscsi/nodes/* /var/lib/iscsi/send_targets/*
8) rediscovered targets (after modifying /etc/iscsi/iscsid.conf with
CHAP information)

node.session.auth.authmethod = CHAP
node.session.auth.username = mychapuserhere
node.session.auth.password = mypasshere

9) ran the same ping test and was able to get iscsi sessions to fail
within 2 minutes.
10) I wanted to prove that CHAP was the issue. So I logout out of all
iscsi sessions.
11) I disabled CHAP on the EqualLogic
12) rediscovery targets and re-logged in to the sessions (without CHAP
authentication)
13) ran the ping tests and couldn't break it after 30 minutes.
14) added CHAP again and was able to break the sessions within 2
minutes.

So definitely something odd with CHAP (my guess, either in open-iscsi
code or EqualLogic code).  I've asked Roger Lopez, from Dell, to attempt
to reproduce this in his lab.  He has EqualLogic and Oracle VM Servers.
Oracle developers that I'm working with don't currently have access to
an EqualLogic, but they are attempting to reproduce this with their
iSCSI equipment as well.  I'm going to setup port mirroring on our
switch and run tcpdumps to see what I can get.

Thanks for all you help so for.  Everyone that I've been working with on
this issue has been great help!

Thanks,
Joe

-----Original Message-----
From: open-iscsi@googlegroups.com [mailto:open-is...@googlegroups.com]
On Behalf Of Mike Christie
Sent: Saturday, July 11, 2009 7:42 PM
To: open-iscsi@googlegroups.com
Subject: Re: iscsiadm -m iface + routing


On 07/10/2009 01:06 PM, Hoot, Joseph wrote:
> Mike,
>
> I have some more details on this.  It seems that a simple `ping -Ieth2
> -i1 192.168.0.19`<-- our group IP to the EqualLogic is able to "reset
> sessions."
>
> eth0 = 1 active nic in the bond (public network)
> eth2 = iface eth2 (192.168.0.151/16)
> eth3 = iface eth3 (192.168.0.161/16)
>
> I slammed the public network for that system from 3 external systems
at
> roughly 101MB/s<-- very nicely slammed for gigabit :) with netcat's to
> /dev/null.
>
> I had 8 netcat connections going through public network for about 25
> minutes without a single hiccup (as expected).
>
> For the iSCSI side I had previously done performance testing with dt
as
> well as dd with bs=1M and was slamming the EqualLogic storage getting
> around 60MB/s reads on average with (2) systems each having OCFS2
shared
> storage and (2) iSCSI sessions each.  Writes were between 30MB/s and
> 155MB/s depending on which EqualLogic array was being hit (SATA vs
> SAS15k respectively).  This seemed to work well with a read and a
write
> going on simultaneously for about 2 hours.
>
> As soon as I introduce pings:
> [r...@oim6102501 ~]# ping -Ieth2 192.168.0.19&  ping -I eth3
192.168.0.19
> [r...@oim6102504 ~]# ping -Ieth2 192.168.0.19&  ping -I eth3
192.168.0.19
>
> I receive the following sessions failing, according to the EqualLogic
> INFO  7/10/09  11:02:02 AM
>     SATA001  iSCSI session to target '192.168.0.30:3260,
>
iqn.2001-05.com.equallogic:0-8a0906-82f16c402-fe30000b33e4a3bc-ovm-1-lun
> 0'
>     from initiator '192.168.0.161:45531,
> iqn.1994-05.com.redhat:c79dbacd466' was closed.
>     iSCSI initiator connection failure.   Reset received on the
> connection.
>
>
> Or according to /var/log/messages on my OVM Server:
>
> Jul 10 11:02:12 oim6102501 kernel: ping timeout of 10 secs expired,
last
> rx 16848993, last ping 16851493, now 16852743


The target is getting the errors because the initiator's iscsi pings 
(nops) are not completing within those noop values I described in the 
last mail.

I have no idea why a network ping would cause the iscsi ping to fail. 
Maybe it is causing something to go wrong in the network routing. I 
really have no idea at this point though. I have never seen this before.

If you were slamming the network while running the iscsi traffic then 
this could cause the iscsi pings to take longer than noop_timeout 
seconds due to the nop getting stuck behind a long scsi/iscsi command 
and the non iscsi network test slowing down the iscsi traffic. However, 
just doing the ping commands above should not cause a problem.

If you turn off nops completely by setting those two noop values to
zero:
node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0

you should not see the ping timeout errors. But do you then see a "Host 
reset succeeded" message?


One other question. It looks like the iscsi code you are using is from 
code based on 5.2. There was a bug in there where we would think a ping 
timedout when it did not. I do not think you are hitting this, but if 
you could make sure that you are running something based on Red Hat's 
5.1 it could rule that out.


> Jul 10 11:02:12 oim6102501 kernel:  connection1:0: iscsi: detected
conn
> error (1011)
> Jul 10 11:02:12 oim6102501 iscsid: Kernel reported iSCSI connection
1:0
> error (1011) state (3)
> Jul 10 11:02:27 oim6102501 kernel: iscsi: cmd 0x28 is not queued (8)
> Jul 10 11:02:27 oim6102501 kernel:  session1: iscsi: session recovery
> timed out after 15 secs
> Jul 10 11:02:27 oim6102501 kernel: sd 5:0:0:0: SCSI error: return code
=
> 0x00010000
>
> As soon as I do `killall ping`, within 1 minute the session will
> reconnect and dm-multipath will be happy again.
>
> So I'm wondering two things here:
>
> 1) I looked at the changelog between rpms.  I've included them below
> (actually Tom from Oracle did, but I'm just relaying this) and don't
see
> any specific bug that talks about the "pdus with cmd sequences out of
> order."  I did a google search and found a bunch of changelog info
here
> http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.29 but
> couldn't find the specific pdus with cmd sequences.  Would you mind
> pointing me to a publicly available bug repo where I can dig further
on
> this?  Or you if you happen to know the bug number I can do searches
on
> that as well.
>

I do not have a red hat bugzilla. Here is the upstream commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi
t;h=77a23c21aaa723f6b0ffc4a701be8c8e5a32346d

I do not think you are hitting this problem though. If you were you 
would not see that iscsi ping timeout message.



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to