On 02/04/2010 06:57 AM, --[ UxBoD ]-- wrote:
Hi,
we have recently experienced a problem when running a mkfs.ext4 on a freshly
presented LUN spirals the CPU of out control and makes all the KVM guests on
the same host freeze.
On checking /var/log/messages when it happens I see:
------------------------------------------/ snip
/--------------------------------------------------------
Feb 4 01:29:09 kvm01 kernel: connection1:0: ping timeout of 3 secs expired,
last rx 4295304767, last ping 4295307767, now 4295310767
Feb 4 01:29:09 kvm01 kernel: connection1:0: detected conn error (1011)
Feb 4 01:29:10 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)
Feb 4 01:30:06 kvm01 kernel: BUG: soft lockup - CPU#5 stuck for 61s!
[kblockd/5:280]
Feb 4 01:30:52 kvm01 iscsid: connection1:0 is operational after recovery (1
attempts)
Does the guest have one CPU? How many does the host have?
After you saw the operational after recovery message did the cpu use go
back to normal, and did the other guests unfreeze?
The ping timeout message could indicate a temporary network problem or
it could be a bug where we were firing the error when the ping timedout
but other IO was making progress (the ping just got stuck behind other
larger IOs). That is fixed in 2.6.31.
The ping timeout would cause the initiator to drop the session and so
you see the detected conn error messages and later you see the
operational after recovery message. Between those messages what happens
is that the iscsi layer will tell the scsi layer to not send it any IO
since it is trying to reconnect to the target and cannot execute anything.
It looks like we had the scsi layer stop IO, the soft lock up code spit
out a error. I have to look at that more closely, because if that
process is stuck on scsi_request_fn (maybe on a spin lock), then that
would be why we see cpu usage go so high and nothing happening. But we
should not be stuck on a spin lock in there.
Is this something you can easily replicate?
Also unless you are using dm-multipath or doing some sort of clustering
that requires it, a ping timeout of 3 secs is really short. I would bump
it to more like 15 or 10 (maybe the node.conn[0].timeo.noop_out_interval
should be 15 and the node.conn[0].timeo.noop_out_timeout 10).
Feb 4 01:30:57 kvm01 kernel: connection1:0: ping timeout of 3 secs expired,
last rx 4295413014, last ping 4295416014, now 4295419014
Feb 4 01:30:57 kvm01 kernel: connection1:0: detected conn error (1011)
Feb 4 01:30:58 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1011)
state (3)
Feb 4 01:31:23 kvm01 kernel: connection1:0: detected conn error (1019)
This one is pretty rare. tcp_sendpage returned a error.
Feb 4 01:31:23 kvm01 iscsid: Kernel reported iSCSI connection 1:0 error (1019)
state (1)
Feb 4 01:31:27 kvm01 iscsid: connection1:0 is operational after recovery (1
attempts)
------------------------------------------/ snip
/--------------------------------------------------------
This is with kernel and initiator tools:
Linux kvm01.xxxxxxxx.xxx 2.6.29.1 #1 SMP Sat Apr 11 20:03:55 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux
iscsi-initiator-utils-6.2.0.871-0.10.el5
Prior to the mkfs the host was running fine with four others guests on multiple
iSCSI LUNs.
Having checked the network switch all looks pretty good. Would be grateful
some advice on how to track down the issue.
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/open-iscsi?hl=en.