----- "Mike Christie" <[email protected]> wrote:
> On 02/04/2010 06:57 AM, --[ UxBoD ]-- wrote:
> > Hi,
> >
> > we have recently experienced a problem when running a mkfs.ext4 on a
> freshly presented LUN spirals the CPU of out control and makes all the
> KVM guests on the same host freeze.
> >
> > On checking /var/log/messages when it happens I see:
> >
> > ------------------------------------------/ snip
> /--------------------------------------------------------
> > Feb 4 01:29:09 kvm01 kernel: connection1:0: ping timeout of 3 secs
> expired, last rx 4295304767, last ping 4295307767, now 4295310767
> > Feb 4 01:29:09 kvm01 kernel: connection1:0: detected conn error
> (1011)
> > Feb 4 01:29:10 kvm01 iscsid: Kernel reported iSCSI connection 1:0
> error (1011) state (3)
>
>
>
> > Feb 4 01:30:06 kvm01 kernel: BUG: soft lockup - CPU#5 stuck for
> 61s! [kblockd/5:280]
> > Feb 4 01:30:52 kvm01 iscsid: connection1:0 is operational after
> recovery (1 attempts)
>
>
> Does the guest have one CPU? How many does the host have?
>
> After you saw the operational after recovery message did the cpu use
> go
> back to normal, and did the other guests unfreeze?
>
>
> The ping timeout message could indicate a temporary network problem or
>
> it could be a bug where we were firing the error when the ping
> timedout
> but other IO was making progress (the ping just got stuck behind other
>
> larger IOs). That is fixed in 2.6.31.
>
> The ping timeout would cause the initiator to drop the session and so
>
> you see the detected conn error messages and later you see the
> operational after recovery message. Between those messages what
> happens
> is that the iscsi layer will tell the scsi layer to not send it any IO
>
> since it is trying to reconnect to the target and cannot execute
> anything.
>
> It looks like we had the scsi layer stop IO, the soft lock up code
> spit
> out a error. I have to look at that more closely, because if that
> process is stuck on scsi_request_fn (maybe on a spin lock), then that
>
> would be why we see cpu usage go so high and nothing happening. But we
>
> should not be stuck on a spin lock in there.
>
> Is this something you can easily replicate?
>
>
> Also unless you are using dm-multipath or doing some sort of
> clustering
> that requires it, a ping timeout of 3 secs is really short. I would
> bump
> it to more like 15 or 10 (maybe the
> node.conn[0].timeo.noop_out_interval
> should be 15 and the node.conn[0].timeo.noop_out_timeout 10).
>
>
>
>
> > Feb 4 01:30:57 kvm01 kernel: connection1:0: ping timeout of 3 secs
> expired, last rx 4295413014, last ping 4295416014, now 4295419014
> > Feb 4 01:30:57 kvm01 kernel: connection1:0: detected conn error
> (1011)
> > Feb 4 01:30:58 kvm01 iscsid: Kernel reported iSCSI connection 1:0
> error (1011) state (3)
> > Feb 4 01:31:23 kvm01 kernel: connection1:0: detected conn error
> (1019)
>
> This one is pretty rare. tcp_sendpage returned a error.
>
> > Feb 4 01:31:23 kvm01 iscsid: Kernel reported iSCSI connection 1:0
> error (1019) state (1)
> > Feb 4 01:31:27 kvm01 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> > ------------------------------------------/ snip
> /--------------------------------------------------------
> >
> > This is with kernel and initiator tools:
> >
> > Linux kvm01.xxxxxxxx.xxx 2.6.29.1 #1 SMP Sat Apr 11 20:03:55 EDT
> 2009 x86_64 x86_64 x86_64 GNU/Linux
> > iscsi-initiator-utils-6.2.0.871-0.10.el5
> >
> > Prior to the mkfs the host was running fine with four others guests
> on multiple iSCSI LUNs.
> >
> > Having checked the network switch all looks pretty good. Would be
> grateful some advice on how to track down the issue.
Hi Mike,
thank you for the quick response.
The server has 8 CPUs in total and we are using a Intel Quad port card and
e1000 driver with dm-multipath. I have just done another testing using dd and
it exhibited the same problem. Managed to capture a iscsiadm -m session -P 3
before I had to kill off the dd:
Target: iqn.1986-03.com.sun:02:kvm01
Current Portal: 172.30.13.78:3260,2
Persistent Portal: 172.30.13.78:3260,2
**********
Interface:
**********
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.2008-05.xxxxxxxxxxxxxxx:kvm01-general
Iface IPaddress: 172.30.13.67
Iface HWaddress: default
Iface Netdev: default
SID: 1
iSCSI Connection State: TRANSPORT WAIT
iSCSI Session State: FAILED
Internal iscsid Session State: REPOEN
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 65536
FirstBurstLength: 65536
MaxBurstLength: 524288
ImmediateData: Yes
InitialR2T: Yes
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 8 State: running
scsi8 Channel 00 Id 0 Lun: 0
Attached scsi disk sdc State: blocked
scsi8 Channel 00 Id 0 Lun: 1
Attached scsi disk sdd State: blocked
scsi8 Channel 00 Id 0 Lun: 2
Attached scsi disk sde State: blocked
scsi8 Channel 00 Id 0 Lun: 3
Attached scsi disk sdf State: blocked
scsi8 Channel 00 Id 0 Lun: 4
Attached scsi disk sdg State: blocked
scsi8 Channel 00 Id 0 Lun: 5
Attached scsi disk sdh State: blocked
scsi8 Channel 00 Id 0 Lun: 6
Attached scsi disk sdi State: blocked
scsi8 Channel 00 Id 0 Lun: 7
Attached scsi disk sdj State: blocked
The other three ports were find and accepted requests. As soon as I killed off
the dd CPU returned back to normal and the guests were responsive again.
--
Thanks, Phil
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/open-iscsi?hl=en.