Re: CPU consumes 100% and freezes KVM guests when mkfs.ext4 is ran

Mike Christie Thu, 04 Feb 2010 09:43:39 -0800

On 02/04/2010 06:57 AM, --[ UxBoD ]-- wrote:

Hi,


we have recently experienced a problem when running a mkfs.ext4 on a freshly 
presented LUN spirals the CPU of out control and makes all the KVM guests on 
the same host freeze.

On checking /var/log/messages when it happens I see:

------------------------------------------/ snip 
/--------------------------------------------------------
Feb  4 01:29:09 kvm01 kernel:  connection1:0: ping timeout of 3 secs expired, 
last rx 4295304767, last ping 4295307767, now 4295310767
Feb  4 01:29:09 kvm01 kernel:  connection1:0: detected conn error (1011)
Feb  4 01:29:10 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) 
state (3)

Feb  4 01:30:06 kvm01 kernel: BUG: soft lockup - CPU#5 stuck for 61s! 
[kblockd/5:280]
Feb  4 01:30:52 kvm01 iscsid: connection1:0 is operational after recovery (1 
attempts)



Does the guest have one CPU?  How many does the host have?

After you saw the operational after recovery message did the cpu use goback to normal, and did the other guests unfreeze?

The ping timeout message could indicate a temporary network problem orit could be a bug where we were firing the error when the ping timedoutbut other IO was making progress (the ping just got stuck behind otherlarger IOs). That is fixed in 2.6.31.

The ping timeout would cause the initiator to drop the session and soyou see the detected conn error messages and later you see theoperational after recovery message. Between those messages what happensis that the iscsi layer will tell the scsi layer to not send it any IOsince it is trying to reconnect to the target and cannot execute anything.

It looks like we had the scsi layer stop IO, the soft lock up code spitout a error. I have to look at that more closely, because if thatprocess is stuck on scsi_request_fn (maybe on a spin lock), then thatwould be why we see cpu usage go so high and nothing happening. But weshould not be stuck on a spin lock in there.


Is this something you can easily replicate?

Also unless you are using dm-multipath or doing some sort of clusteringthat requires it, a ping timeout of 3 secs is really short. I would bumpit to more like 15 or 10 (maybe the node.conn[0].timeo.noop_out_intervalshould be 15 and the node.conn[0].timeo.noop_out_timeout 10).

Feb  4 01:30:57 kvm01 kernel:  connection1:0: ping timeout of 3 secs expired, 
last rx 4295413014, last ping 4295416014, now 4295419014
Feb  4 01:30:57 kvm01 kernel:  connection1:0: detected conn error (1011)
Feb  4 01:30:58 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) 
state (3)
Feb  4 01:31:23 kvm01 kernel:  connection1:0: detected conn error (1019)


This one is pretty rare. tcp_sendpage returned a error.

Feb  4 01:31:23 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1019) 
state (1)
Feb  4 01:31:27 kvm01 iscsid: connection1:0 is operational after recovery (1 
attempts)
------------------------------------------/ snip 
/--------------------------------------------------------

This is with kernel and initiator tools:

Linux kvm01.xxxxxxxx.xxx 2.6.29.1 #1 SMP Sat Apr 11 20:03:55 EDT 2009 x86_64 
x86_64 x86_64 GNU/Linux
iscsi-initiator-utils-6.2.0.871-0.10.el5

Prior to the mkfs the host was running fine with four others guests on multiple 
iSCSI LUNs.

Having checked the network switch all looks pretty good.  Would be grateful 
some advice on how to track down the issue.


--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: CPU consumes 100% and freezes KVM guests when mkfs.ext4 is ran

Reply via email to