Re: RHEL 5.2 and 5.3 - ISCSI Errors impacting database performance?

Mike Christie Fri, 13 Mar 2009 13:44:41 -0700

bigcatxjs wrote:
> UPDATE: RHEL 5.3 Host is showing errors.  No Disk I/O to SAN volume
> (last I/O Thursday 12th March);
>


Is there anything in the log before this? Something about a ping or nop 
timing out?

> Mar 13 10:38:49 MYHOST53 kernel:  connection1:0: iscsi: detected conn
> error (1011)
> Mar 13 10:38:49 MYHOST53 iscsid: Kernel reported iSCSI connection 1:0
> error (1011) state (3)
> Mar 13 10:38:52 MYHOST53 iscsid: received iferror -38
> Mar 13 10:38:52 MYHOST53 last message repeated 2 times
> Mar 13 10:38:52 MYHOST53 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> Mar 13 11:00:06 MYHOST53 kernel:  connection1:0: iscsi: detected conn
> error (1011)
> Mar 13 11:00:06 MYHOST53 iscsid: Kernel reported iSCSI connection 1:0
> error (1011) state (3)
> Mar 13 11:00:09 MYHOST53 iscsid: received iferror -38
> Mar 13 11:00:09 MYHOST53 last message repeated 2 times
> Mar 13 11:00:09 MYHOST53 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> 
> Thanks, Rich.
> 
> END.
> 
> On Mar 13, 10:01 am, bigcatxjs <ad...@richardjamestrading.co.uk>
> wrote:
>> Thanks Mike,
>>
>>> For this RHEL 5.2 setup, does it make a difference if you do not use
>>> ifaces and setup the box like in 5.3 below?
>> I have used bonded ifaces so that the I/O requests can be split across
>> multiple NICS (both Server-side and on the Datacore San Melody SM node
>> NICS).  This split is acheived by ensuring that the volumes used by
>> Oracle containing DATA and INDEX datafiles route through one named
>> Iface and that volumes used by Oracle for SYSTEM, BACKUP, and REDO
>> data / logs etc route through the other.  We have seen a performance
>> uplift by maintaining this split despite the time-out issues.  We have
>> a W2K3 x86_64 STD Oracle host that runs on one iface - this is much
>> slower than the RHEL 5.2 x86_64 host even though the hardware is
>> identical.  We did have RHEL 5.1 x86_64 Oracle hosts running on one
>> iface - again, this was noticibly slower than the bonded ifaces
>> approach.  This have been upgraded to RHEL 5.2 with the multiple
>> ifaces.
>>
>>> There was a bug in 5.2 where the initiator would think it detected a
>>> timeout when it did not. It is fixed in 5.3.
>> Good.  Then I should expect to see less errors.
>>
>>> The messages can also occur when there really is a problem with the
>>> network or if the target is bogged down.
>> We have spread the primary volumes across both SM nodes.  The nodes
>> are WK23K x86 (no x64 option for the DataCore Software) DELL 2850's.
>> There are two switches (one for SM1, one for SM2) that are linked
>> using teamed Fibre (2GB sec capacity).  Thus I/O should route evenly
>> across both switches.  The SM mirroring takes advantage of the Fibre.
>> With the RHEL 5.2 host, you will note that both ifaces are goiing to
>> SM2 node, but utilising different NICS on the SM2 node.  These volumes
>> are then mirrored to SM1 (except the BACKUP volume, which is a linear
>> volume).  We know that the switches aren't congested, but we don't
>> accurately know if SM1 or SM2 are congested.  We only have a logical
>> spread of volumes presented across multiple NICS to at least try and
>> minimise congestion.
>>
>>> At these times is there lots of disk IO? Is there anything in the target
>>> logs?
>> It is fair to say that all these volumes take a heavy hit, in terms of
>> I/O.  Each host (excluding the RHEL 5.3. test host) run two Oracle
>> databases, of which some have intra-database replication (Oracle
>> Streams) enabled.  The issue on the RHEL 5.2 host occures every 10
>> secs or so during Office Hours when it is being utilised.
>>
>>> So the RHEL5.3 box is having troubles too? There is nothing in the log
>>> below.
>> The error with the RHEL 5.3 host was as follows;
>>
>>> Mar 11 18:12:03 MYHOST53 iscsid: received iferror -38
>>> Mar 11 18:12:03 MYHOST53 last message repeated 2 times
>>> Mar 11 18:12:03 MYHOST53 iscsid: connection1:0 is operational now
>> This looked similar to previous RHEL 5.2 errors.
>>
>>> Can you replicate this pretty easily? If you just login the session,
>>> then let it sit (do not run the db or any disk IO), will you see the
>>> ping timeout errors?
>> I can test this with the RHEL 5.3 host.  Unfortunately, it will be
>> difficult to down the RHEL 5.2 host's database services until we have
>> a scheduled outage window.
>>
>> Today, there have been no further errors on RHEL 5.3 host :>).
>>
>>> It might be helpful to run ethereal/wireshark while you run your test
>>> then send the /var/log/messages and trace so I can check and see if the
>>> ping is really timing out or not. For the test you only need one session
>>> logged in (this will reduce log and trace info), and once you see the
>>> first ping timeout error you can stop tracing/logging and send it.
>> Yes; there is also an Oracle tool (Orion) that we could also use.
>>
>> I think that I will monitor the RHEL 5.3 host for any further errors.
>> If the incidence of errors is reduced, then this gives justification
>> to upgrading the RHEL 5.2 host to 5.3.  Such an outage would provide
>> me with an opportunity to perform the tests above as well.
>>
>> Many thanks,
>> Richard.
>>
>> END.
>>
>> With the RHEL 5.2 host
>>
>> On Mar 12, 5:53 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
>>
>>
>>
>>> bigcatxjs wrote:
>>> For this RHEL 5.2 setup, does it make a difference if you do not use
>>> ifaces and setup the box like in 5.3 below?
>>>> iscsiadm:
>>>> iSCSI Transport Class version 2.0-724
>>>> iscsiadm version 2.0-868
>>>> Target: iqn.2000-08.com.datacore:sm2-3
>>>>    Current Portal: 172.16.200.9:3260,1
>>>>    Persistent Portal: 172.16.200.9:3260,1
>>>>            **********
>>>>            Interface:
>>>>            **********
>>>>            Iface Name: iface0
>>>>            Iface Transport: tcp
>>>>            Iface Initiatorname: iqn.1994-05.com.redhat:7fe2f44ea9de
>>>>            Iface IPaddress: 172.16.200.39
>>>>            Iface HWaddress: 00:14:22:0d:0a:fa
>>>>            Iface Netdev: default
>>>>            SID: 1
>>>>            iSCSI Connection State: LOGGED IN
>>>>            iSCSI Session State: Unknown
>>>>            Internal iscsid Session State: NO CHANGE
>>>>            ************************
>>>>            Negotiated iSCSI params:
>>>>            ************************
>>>>            HeaderDigest: None
>>>>            DataDigest: None
>>>>            MaxRecvDataSegmentLength: 131072
>>>>            MaxXmitDataSegmentLength: 262144
>>>>            FirstBurstLength: 0
>>>>            MaxBurstLength: 1048576
>>>>            ImmediateData: No
>>>>            InitialR2T: Yes
>>>>            MaxOutstandingR2T: 1
>>>>            ************************
>>>>            Attached SCSI devices:
>>>>            ************************
>>>>            Host Number: 1  State: running
>>>>            scsi1 Channel 00 Id 0 Lun: 0
>>>>                    Attached scsi disk sdb          State: running
>>>>            scsi1 Channel 00 Id 0 Lun: 1
>>>>                    Attached scsi disk sde          State: running
>>>>            scsi1 Channel 00 Id 0 Lun: 2
>>>>                    Attached scsi disk sdf          State: running
>>>> Target: iqn.2000-08.com.datacore:sm2-4
>>>>    Current Portal: 172.16.200.10:3260,1
>>>>    Persistent Portal: 172.16.200.10:3260,1
>>>>            **********
>>>>            Interface:
>>>>            **********
>>>>            Iface Name: iface2
>>>>            Iface Transport: tcp
>>>>            Iface Initiatorname: iqn.1994-05.com.redhat:7fe2f44ea9de
>>>>            Iface IPaddress: 172.16.200.56
>>>>            Iface HWaddress: 00:14:22:b1:d6:a6
>>>>            Iface Netdev: default
>>>>            SID: 2
>>>>            iSCSI Connection State: LOGGED IN
>>>>            iSCSI Session State: Unknown
>>>>            Internal iscsid Session State: NO CHANGE
>>>>            ************************
>>>>            Negotiated iSCSI params:
>>>>            ************************
>>>>            HeaderDigest: None
>>>>            DataDigest: None
>>>>            MaxRecvDataSegmentLength: 131072
>>>>            MaxXmitDataSegmentLength: 262144
>>>>            FirstBurstLength: 0
>>>>            MaxBurstLength: 1048576
>>>>            ImmediateData: No
>>>>            InitialR2T: Yes
>>>>            MaxOutstandingR2T: 1
>>>>            ************************
>>>>            Attached SCSI devices:
>>>>            ************************
>>>>            Host Number: 2  State: running
>>>>            scsi2 Channel 00 Id 0 Lun: 0
>>>>                    Attached scsi disk sdc          State: running
>>>>            scsi2 Channel 00 Id 0 Lun: 1
>>>>                    Attached scsi disk sdd          State: running
>>>> Log Errors;co
>>>> Mar 12 09:30:48 MYHOST52 last message repeated 2 times
>>>> Mar 12 09:30:48 MYHOST52 iscsid: connection2:0 is operational after
>>>> recovery (1 attempts)
>>>> Mar 12 09:32:52 MYHOST52 kernel: ping timeout of 5 secs expired, last
>>>> rx 19592296349, last ping 19592301349, now 19592306349
>>> There was a bug in 5.2 where the initiator would think it detected a
>>> timeout when it did not. It is fixed in 5.3.
>>> The messages can also occur when there really is a problem with the
>>> network or if the target is bogged down.
>>> At these times is there lots of disk IO? Is there anything in the target
>>> logs?
>>> I am also not sure how well some targets handle bonding plus ifaces. Is
>>> iface* using a bonded interface?
>>> Can you replicate this pretty easily? If you just login the session,
>>> then let it sit (do not run the db or any disk IO), will you see the
>>> ping timeout errors?
>>> It might be helpful to run ethereal/wireshark while you run your test
>>> then send the /var/log/messages and trace so I can check and see if the
>>> ping is really timing out or not. For the test you only need one session
>>> logged in (this will reduce log and trace info), and once you see the
>>> first ping timeout error you can stop tracing/logging and send it.
>>>> From RHEL 5.3 x86 Host;
>>> So the RHEL5.3 box is having troubles too? There is nothing in the log
>>> below.
>>>> iscsiadm;
>>>> iSCSI Transport Class version 2.0-724
>>>> iscsiadm version 2.0-868
>>>> Target: iqn.2000-08.com.datacore:sm2-3
>>>>    Current Portal: 172.16.200.9:3260,1
>>>>    Persistent Portal: 172.16.200.9:3260,1
>>>>            **********
>>>>            Interface:
>>>>            **********
>>>>            Iface Name: default
>>>>            Iface Transport: tcp
>>>>            Iface Initiatorname: iqn.2005-03.com.redhat:01.406e5fd710e2
>>>>            Iface IPaddress: 172.16.200.69
>>>>          - Hide quoted text -
>> - Show quoted text -...
>>
>> read more »
> > 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Re: RHEL 5.2 and 5.3 - ISCSI Errors impacting database performance?

Reply via email to