----- "Mike Christie" <[email protected]> wrote:

> On 02/04/2010 06:57 AM, --[ UxBoD ]-- wrote:
> > Hi,
> >
> > we have recently experienced a problem when running a mkfs.ext4 on a
> freshly presented LUN spirals the CPU of out control and makes all the
> KVM guests on the same host freeze.
> >
> > On checking /var/log/messages when it happens I see:
> >
> > ------------------------------------------/ snip
> /--------------------------------------------------------
> > Feb  4 01:29:09 kvm01 kernel:  connection1:0: ping timeout of 3 secs
> expired, last rx 4295304767, last ping 4295307767, now 4295310767
> > Feb  4 01:29:09 kvm01 kernel:  connection1:0: detected conn error
> (1011)
> > Feb  4 01:29:10 kvm01 iscsid: Kernel reported iSCSI connection 1:0
> error (1011) state (3)
> 
> 
> 
> > Feb  4 01:30:06 kvm01 kernel: BUG: soft lockup - CPU#5 stuck for
> 61s! [kblockd/5:280]
> > Feb  4 01:30:52 kvm01 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> 
> 
> Does the guest have one CPU?  How many does the host have?
> 
> After you saw the operational after recovery message did the cpu use
> go 
> back to normal, and did the other guests unfreeze?
> 
> 
> The ping timeout message could indicate a temporary network problem or
> 
> it could be a bug where we were firing the error when the ping
> timedout 
> but other IO was making progress (the ping just got stuck behind other
> 
> larger IOs). That is fixed in 2.6.31.
> 
> The ping timeout would cause the initiator to drop the session and so
> 
> you see the detected conn error messages and later you see the 
> operational after recovery message. Between those messages what
> happens 
> is that the iscsi layer will tell the scsi layer to not send it any IO
> 
> since it is trying to reconnect to the target and cannot execute
> anything.
> 
> It looks like we had the scsi layer stop IO, the soft lock up code
> spit 
> out a error. I have to look at that more closely, because if that 
> process is stuck on scsi_request_fn (maybe on a spin lock), then that
> 
> would be why we see cpu usage go so high and nothing happening. But we
> 
> should not be stuck on a spin lock in there.
> 
> Is this something you can easily replicate?
> 
> 
> Also unless you are using dm-multipath or doing some sort of
> clustering 
> that requires it, a ping timeout of 3 secs is really short. I would
> bump 
> it to more like 15 or 10 (maybe the
> node.conn[0].timeo.noop_out_interval 
> should be 15 and the node.conn[0].timeo.noop_out_timeout 10).
> 
> 
> 
> 
> > Feb  4 01:30:57 kvm01 kernel:  connection1:0: ping timeout of 3 secs
> expired, last rx 4295413014, last ping 4295416014, now 4295419014
> > Feb  4 01:30:57 kvm01 kernel:  connection1:0: detected conn error
> (1011)
> > Feb  4 01:30:58 kvm01 iscsid: Kernel reported iSCSI connection 1:0
> error (1011) state (3)
> > Feb  4 01:31:23 kvm01 kernel:  connection1:0: detected conn error
> (1019)
> 
> This one is pretty rare. tcp_sendpage returned a error.
> 
> > Feb  4 01:31:23 kvm01 iscsid: Kernel reported iSCSI connection 1:0
> error (1019) state (1)
> > Feb  4 01:31:27 kvm01 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> > ------------------------------------------/ snip
> /--------------------------------------------------------
> >
> > This is with kernel and initiator tools:
> >
> > Linux kvm01.xxxxxxxx.xxx 2.6.29.1 #1 SMP Sat Apr 11 20:03:55 EDT
> 2009 x86_64 x86_64 x86_64 GNU/Linux
> > iscsi-initiator-utils-6.2.0.871-0.10.el5
> >
> > Prior to the mkfs the host was running fine with four others guests
> on multiple iSCSI LUNs.
> >
> > Having checked the network switch all looks pretty good.  Would be
> grateful some advice on how to track down the issue.
Hi Mike,

thank you for the quick response.

The server has 8 CPUs in total and we are using a Intel Quad port card and 
e1000 driver with dm-multipath.  I have just done another testing using dd and 
it exhibited the same problem.  Managed to capture a iscsiadm -m session -P 3 
before I had to kill off the dd:

Target: iqn.1986-03.com.sun:02:kvm01
        Current Portal: 172.30.13.78:3260,2
        Persistent Portal: 172.30.13.78:3260,2
                **********
                Interface:
                **********
                Iface Name: default
                Iface Transport: tcp
                Iface Initiatorname: iqn.2008-05.xxxxxxxxxxxxxxx:kvm01-general
                Iface IPaddress: 172.30.13.67
                Iface HWaddress: default
                Iface Netdev: default
                SID: 1
                iSCSI Connection State: TRANSPORT WAIT
                iSCSI Session State: FAILED
                Internal iscsid Session State: REPOEN
                ************************
                Negotiated iSCSI params:
                ************************
                HeaderDigest: None
                DataDigest: None
                MaxRecvDataSegmentLength: 131072
                MaxXmitDataSegmentLength: 65536
                FirstBurstLength: 65536
                MaxBurstLength: 524288
                ImmediateData: Yes
                InitialR2T: Yes
                MaxOutstandingR2T: 1
                ************************
                Attached SCSI devices:
                ************************
                Host Number: 8  State: running
                scsi8 Channel 00 Id 0 Lun: 0
                        Attached scsi disk sdc          State: blocked
                scsi8 Channel 00 Id 0 Lun: 1
                        Attached scsi disk sdd          State: blocked
                scsi8 Channel 00 Id 0 Lun: 2
                        Attached scsi disk sde          State: blocked
                scsi8 Channel 00 Id 0 Lun: 3
                        Attached scsi disk sdf          State: blocked
                scsi8 Channel 00 Id 0 Lun: 4
                        Attached scsi disk sdg          State: blocked
                scsi8 Channel 00 Id 0 Lun: 5
                        Attached scsi disk sdh          State: blocked
                scsi8 Channel 00 Id 0 Lun: 6
                        Attached scsi disk sdi          State: blocked
                scsi8 Channel 00 Id 0 Lun: 7
                        Attached scsi disk sdj          State: blocked

The other three ports were find and accepted requests.  As soon as I killed off 
the dd CPU returned back to normal and the guests were responsive again.

-- 
Thanks, Phil

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to