On 02/04/2010 06:57 AM, --[ UxBoD ]-- wrote:
Hi,

we have recently experienced a problem when running a mkfs.ext4 on a freshly 
presented LUN spirals the CPU of out control and makes all the KVM guests on 
the same host freeze.

On checking /var/log/messages when it happens I see:

------------------------------------------/ snip 
/--------------------------------------------------------
Feb  4 01:29:09 kvm01 kernel:  connection1:0: ping timeout of 3 secs expired, 
last rx 4295304767, last ping 4295307767, now 4295310767
Feb  4 01:29:09 kvm01 kernel:  connection1:0: detected conn error (1011)
Feb  4 01:29:10 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) 
state (3)



Feb  4 01:30:06 kvm01 kernel: BUG: soft lockup - CPU#5 stuck for 61s! 
[kblockd/5:280]
Feb  4 01:30:52 kvm01 iscsid: connection1:0 is operational after recovery (1 
attempts)


Does the guest have one CPU?  How many does the host have?

After you saw the operational after recovery message did the cpu use go back to normal, and did the other guests unfreeze?


The ping timeout message could indicate a temporary network problem or it could be a bug where we were firing the error when the ping timedout but other IO was making progress (the ping just got stuck behind other larger IOs). That is fixed in 2.6.31.

The ping timeout would cause the initiator to drop the session and so you see the detected conn error messages and later you see the operational after recovery message. Between those messages what happens is that the iscsi layer will tell the scsi layer to not send it any IO since it is trying to reconnect to the target and cannot execute anything.

It looks like we had the scsi layer stop IO, the soft lock up code spit out a error. I have to look at that more closely, because if that process is stuck on scsi_request_fn (maybe on a spin lock), then that would be why we see cpu usage go so high and nothing happening. But we should not be stuck on a spin lock in there.

Is this something you can easily replicate?


Also unless you are using dm-multipath or doing some sort of clustering that requires it, a ping timeout of 3 secs is really short. I would bump it to more like 15 or 10 (maybe the node.conn[0].timeo.noop_out_interval should be 15 and the node.conn[0].timeo.noop_out_timeout 10).




Feb  4 01:30:57 kvm01 kernel:  connection1:0: ping timeout of 3 secs expired, 
last rx 4295413014, last ping 4295416014, now 4295419014
Feb  4 01:30:57 kvm01 kernel:  connection1:0: detected conn error (1011)
Feb  4 01:30:58 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) 
state (3)
Feb  4 01:31:23 kvm01 kernel:  connection1:0: detected conn error (1019)

This one is pretty rare. tcp_sendpage returned a error.

Feb  4 01:31:23 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1019) 
state (1)
Feb  4 01:31:27 kvm01 iscsid: connection1:0 is operational after recovery (1 
attempts)
------------------------------------------/ snip 
/--------------------------------------------------------

This is with kernel and initiator tools:

Linux kvm01.xxxxxxxx.xxx 2.6.29.1 #1 SMP Sat Apr 11 20:03:55 EDT 2009 x86_64 
x86_64 x86_64 GNU/Linux
iscsi-initiator-utils-6.2.0.871-0.10.el5

Prior to the mkfs the host was running fine with four others guests on multiple 
iSCSI LUNs.

Having checked the network switch all looks pretty good.  Would be grateful 
some advice on how to track down the issue.

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to