Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-08-04 Thread Massimiliano Ferrero

Il 08/07/2011 09:50, Ritesh Raj Sarraf ha scritto:

Any update on this issue ? I am lowering its severity to Important as it
is a corner case.


Hello, sorry for not having updated you earlier

I think the issue as nothing to do with an ietd bug.
I think that the server log, even with debug active, just showed that 
ietd was closing the connections, but the problem originates from the 
initiators: there are timeouts in the connections and these are dropped.


I has designed the architecture to boot initiators from usb keys (debian 
over a jffs2 file system) and had the very bad idea to put swap over one 
of the iscsi lvm volumes (one volume for each cluster node).
I think this was one source of problems: any problem on the swap volume 
would result in process crashing on the node.


Now we have bought disks for the nodes and the system is installed on 
the disks, and I see no more errors on the swap space.

I think the bug can be closed, thank you for the support.


We still have some problems of dropped connections, probably I have not 
set correctly iscsi initiators timeout.

This is what I see on the initiator syslog

/var/log/syslog.5.gz:Jul 30 14:40:27 xen002 qdiskd[3187]: qdisk cycle 
took more than 3 seconds to complete (3.47)
/var/log/syslog.5.gz:Jul 30 14:44:39 xen002 kernel: [  537.074759] dlm: 
closing connection to node 2
/var/log/syslog.5.gz:Jul 30 14:44:39 xen002 kernel: [  537.074890] dlm: 
closing connection to node 1
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401455]  
connection12:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401494]  
connection7:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401526]  
connection11:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401639]  
connection6:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401690]  
connection10:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401722]  
connection5:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401750]  
connection3:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401773]  
connection9:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401809]  
connection1:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401850]  
connection8:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.401879]  
connection2:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:44:43 xen002 kernel: [  541.402096]  
connection4:0: detected conn error (1020)
/var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Listening for 
connections on port 5666
/var/log/syslog.5.gz:Jul 30 14:47:59 xen002 nrpe[8427]: Allowing 
connections from: 10.212.0.1,10.213.0.1
/var/log/syslog.5.gz:Jul 30 15:26:31 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.26)
/var/log/syslog.5.gz:Jul 30 15:28:53 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.68)
/var/log/syslog.5.gz:Jul 30 17:52:14 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.01)
/var/log/syslog.5.gz:Jul 30 22:23:45 xen002 qdiskd[3241]: qdisk cycle 
took more than 3 seconds to complete (3.05)
/var/log/syslog.5.gz:Jul 30 23:03:57 xen002 qdiskd[3241]: Assuming 
master role
/var/log/syslog.5.gz:Jul 30 23:04:00 xen002 qdiskd[3241]: Writing 
eviction notice for node 1

/var/log/syslog.5.gz:Jul 30 23:04:03 xen002 qdiskd[3241]: Node 1 evicted
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308142]  
connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308300]  
connection5:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308304]  
connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308407]  
connection4:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308410]  
connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371373, last ping 4302372623, now 4302373873
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.308518]  
connection1:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.316053]  
connection12:0: ping timeout of 5 secs expired, recv timeout 5, last rx 
4302371374, last ping 4302372624, now 4302373875
/var/log/syslog.5.gz:Jul 30 23:04:58 xen002 kernel: [29926.316173]  
connection12:0: detected conn error (1011)
/var/log/syslog.5.gz:Jul 30 23:04:59 xen002 iscsid: Kernel reported 

Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-07-08 Thread Ritesh Raj Sarraf
severity 629442 important
thanks

On 06/14/2011 07:21 PM, Ritesh Raj Sarraf wrote:
 Yes, that's what my suspicion is also. The per bdi writeback mechanism
 improves this situation to a great extent but I'm not sure if that is
 part of the Squeeze kernel.
 http://lwn.net/Articles/326552/

Any update on this issue ? I am lowering its severity to Important as it
is a corner case.

-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System




signature.asc
Description: OpenPGP digital signature


Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-06-14 Thread Ritesh Raj Sarraf
On 06/14/2011 05:12 PM, Massimiliano Ferrero wrote:
 
 Next time, when you try to test/re-create the bug, capture dstat output.
 The default dstat output is good enough to report us on the system state
 was during starvation.
 Hello, yesterday and tonight I performed some other tests, these are the
 results:
 
 1) it seems I am not able to reproduce the bug on a test system
 the test system (san01) has the same processor (E5220) and amount of RAM
 12 GB, but a smaller I/O system: an 8 channel 3ware controller with an 8
 disks raid 5 array
 the system that presents the problem (san00) has a 24 channel controller
 and a 23 disks raid 6 array (+ 1 hot spare)
 both systems are connected through the same gigabit switches
 

Is the OS/kernel also the same?

 there is another hw difference between the two environment: the nodes
 connected to san00 are high end hw, their network card is able to
 generate nearly 1 Gb/s of iscsi traffic
 the nodes connected to san01 are low end hw and their network card does
 not exceed 300 Mb/s
 so the system that presents the problem has both an I/O subsystem with
 higher performance and the machine that is doing iscsi traffic is able
 to generate more than 3 times i/o operations


Unfortunately, dstat default output doesn't capture VM statistics. Do
you have any idea what was the VM consumption when you saw the problem?

Assuming it is buffered I/O, Your VM will soon be consumed (as I see
from dstat logs that the CPU was in wait for a very long time) and then
start paging. And this is the scenario, where linux tends to succumb, still.


On your faulty setup, can you try unbuffered direct I/O and see if
that can trigger the problem? I have low hopes that that will fail. To
trigger (direct) I/O, you can use fio tool. It is already pre-packaged
in Debian.


 at the moment I am not able to tell which of these aspects, or the sum
 of them, create the condition for the problem: I suspect that it's a mix
 of all these
 unfortunately at the moment I do not have hw similar to the one in
 production to perform a test in the same conditions.
 
 2) san00 presents the problem event with deadline scheduler active on
 all logical volume exported through iscsi or used by the heavy load
 operation (dd)
 
 3) on san00 I was able to reproduce the problem in a simpler condition
 than the one I described in the first mail: just one node connected
 through iscsi, the other node was restarting, no virtual machines
 running on the node, the node was performing one i/o intensive operation
 on one of the lv exported by iscsi/lvm (an fsck on one file system)
 during this operation I launched a dd on san00 and the iscsi connection
 was dropped after a few seconds
 

I think it is the typical linux I/O controller problem, which I believe
is a combination of I/O Scheduler + VM Subsystem.

 I am attaching 3 files: dstat output during the test and an extract of
 /var/log/messages and /var/log/syslog
 I have just filtered out information for non relevant services (nagios,
 dhcp, snmp, postfix, etc.) both for readability and confidentiality
 ietd was running with the following command line
 /usr/sbin/ietd --debug=255
 so in the log we have debug information
 the problem can be seen in syslog at Jun 14 01:28:53
 at Jun 14 01:34:06 I turned off the node for reboot and in the log there
 are some record regarding termination of iscsi sessions
 I do not see anything relevant in ietd debug log, just a restart of the
 connections
 
 in dstat output the dd operation was started around line 197 and was
 terminated at line 208 (I interrupted the operation as soon as I saw the
 problem)
 
 what I see in dstat output is the following: dd for some seconds (about
 10) does not generate a lot of read and writes
 
 usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
   7   2  56  35   0   0|  12M   14M|  22k   35k|   0 0 |441512k
 
 12M read and 14M write, and this could be from the dd operation or the
 fsck performed through iscsi
 
 then there is a burst of write, I guess using the full I/O capacity of
 the controller and of the disks
 
 usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 35   7  35  22   0   1|8180k  325M|  38k   25k|   0 0 |686011k
   2   3  59  36   0   1|3072B  541M|  20k   26k|   0 0 |5380  2747
   3   4  64  30   0   0|5120B  473M|  21k   30k|   0 0 |475216k
 
 write 325M, 541M, 473M
 and this is exactly the moment when the problem arise
 
 could it be that the i/o operation are cached in memory and the problem
 presents when they are flushed to disk?
 

Yes, that's what my suspicion is also. The per bdi writeback mechanism
improves this situation to a great extent but I'm not sure if that is
part of the Squeeze kernel.
http://lwn.net/Articles/326552/

 
 If from the logs does not come out any pointer to a potential solution
 the only other test I can think of is upgrading to a newer kernel, but I

You can try this just to ascertain the 

Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-06-11 Thread Ritesh Raj Sarraf
What is the version of the kernel, where you are seeing this problem?
In my experience, older kernels, weren't good at handling huge
workloads. In case this is a starvation issue (which I suspect), there
should be some kernel messages spat. Can you check that?

Also, if possible, can you reproduce the issue with ietd's debug mode
enabled?


Ritesh


On 06/07/2011 12:04 AM, Massimiliano Ferrero wrote:
 Package: iscsitarget
 Version: 1.4.20.2-1
 Severity: grave
 Justification: renders package unusable


 We have a system configured as follows:
 1 CPU Xeon E5520 quad core
 12 GB RAM
 24 SATA 1TB disks connected to a 3ware 9650-24M controller

 The system (I will call it the storage server) is used, also, as an iscsi 
 target for a two node red hat cluster
 (implemented with Debian machines): the storage server exports an iscsi lun 
 that contains a clustered LVM volum group.
 The two nodes mounts the iscsi target, then use clvmd of red hat cluster to 
 manage concurrent access to the logical
 volumes on the clustered volume group.
 The nodes are used for virtualization with xen with the virtual machines 
 stored in the shared volume group, this allows
 for live migration of the virtual machines from one node to the other, each 
 node has 32 cores and 32 GB ram.

 All the three machines (the storage server and the two nodes) are Debian 
 based:
 the iscsi target server was upgraded from lenny to squeeze one week ago, the 
 two nodes are still on Debian lenny.

 Tonight and today we experienced two cases of ietd hang-up: while we were 
 doing two I/O intensive operations on the storage
 server we got the following errors:

 FIRST OCCURRENCE:

 Jun  5 23:19:19 san00 kernel: [334912.274409] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:562949990973952 (Function Complete)
 Jun  5 23:19:20 san00 kernel: [334912.634797] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete)
 Jun  5 23:19:28 san00 kernel: [334920.660528] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:562949990973952 (Function Complete)
 Jun  5 23:19:28 san00 kernel: [334920.660811] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:562949990973952 (Function Complete)
 Jun  5 23:19:28 san00 kernel: [334920.661105] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:562949990973952 (Function Complete)
 Jun  5 23:19:28 san00 kernel: [334920.661357] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:1407374921105920 (Function Complete)
 Jun  5 23:19:28 san00 kernel: [334920.661570] iscsi_trgt: Logical Unit Reset 
 (05) issued on tid:2 lun:0 by sid:562949990973952 (Function Complete)
 Jun  5 23:19:58 san00 kernel: [334950.930122] iscsi_trgt: Abort Task (01) 
 issued on tid:3 lun:0 by sid:844424984461824 (Function Complete)
 Jun  5 23:19:59 san00 kernel: [334951.602686] iscsi_trgt: Abort Task (01) 
 issued on tid:3 lun:0 by sid:1688849914593792 (Function Complete)
 ...
 Jun  5 23:20:31 san00 kernel: [334983.635928] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete)
 Jun  5 23:20:41 san00 kernel: [334993.616383] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete)
 ...
 Jun  5 23:20:51 san00 kernel: [335003.596884] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete)
 Jun  5 23:21:01 san00 kernel: [335013.577396] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete)
 Jun  5 23:21:01 san00 kernel: [335013.577645] iscsi_trgt: Logical Unit Reset 
 (05) issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete)
 Jun  5 23:21:11 san00 kernel: [335023.557901] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:1970324874527232 (Function Complete)
 Jun  5 23:21:13 san00 kernel: [335025.386242] iscsi_trgt: Abort Task (01) 
 issued on tid:3 lun:0 by sid:2533274844725760 (Function Complete)
 ...
 Jun  5 23:24:21 san00 kernel: [335213.072102] iscsi_trgt: Abort Task (01) 
 issued on tid:3 lun:0 by sid:2814749821436416 (Function Complete)
 Jun  5 23:24:21 san00 kernel: [335213.075205] iscsi_trgt: Abort Task (01) 
 issued on tid:3 lun:0 by sid:3659174751568384 (Function Complete)
 Jun  5 23:24:32 san00 kernel: [335224.457895] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete)
 Jun  5 23:24:42 san00 kernel: [335234.437754] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete)
 Jun  5 23:24:42 san00 kernel: [335234.437873] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete)
 Jun  5 23:24:52 san00 kernel: [335244.418062] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete)
 Jun  5 23:24:52 san00 kernel: [335244.418174] iscsi_trgt: Abort Task (01) 
 issued on tid:2 lun:0 by sid:3096224781369856 (Function Complete)
 Jun  5 23:24:54 

Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-06-11 Thread Massimiliano Ferrero

Il 11/06/2011 09:32, Ritesh Raj Sarraf ha scritto:

What is the version of the kernel, where you are seeing this problem?
In my experience, older kernels, weren't good at handling huge
workloads. In case this is a starvation issue (which I suspect), there
should be some kernel messages spat. Can you check that?

Also, if possible, can you reproduce the issue with ietd's debug mode
enabled?


Sorry, tought I had indicated the kernel version: it is 2.6.32-5-amd64

Linux san00 2.6.32-5-amd64 #1 SMP Wed May 18 23:13:22 UTC 2011 x86_64 
GNU/Linux


dpkg -l 'linux-image*'|grep ii
ii  linux-image-2.6.32-5-amd64  
2.6.32-34squeeze1 Linux 2.6.32 for 64-bit PCs


Debian package version 2.6.32-34squeeze1
That should be the last kernel for a squeeze with security updates

There are no other kernel messages in the problem timeframe, I have 
already cutpaste all the relevant messages
I can try to reproduce the problem on a test machine and I will try that 
with debug enabled also, I will do more tests on the production system 
only if I am not able to reproduce the problem on the test system.
Another test I want to perform is enable deadline scheduler for the 
logical volumes exported through iscsi, do you think this could 
influence the problem?


Thank you for the support
Massimiliano

--

Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax  +39-0117768576
e-mail: m.ferr...@midhgard.it
sito web: http://www.midhgard.it




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-06-11 Thread Ritesh Raj Sarraf
On 06/11/2011 01:37 PM, Massimiliano Ferrero wrote:
 That should be the last kernel for a squeeze with security updates
 
 There are no other kernel messages in the problem timeframe, I have
 already cutpaste all the relevant messages
 I can try to reproduce the problem on a test machine and I will try that
 with debug enabled also, I will do more tests on the production system
 only if I am not able to reproduce the problem on the test system.
 Another test I want to perform is enable deadline scheduler for the
 logical volumes exported through iscsi, do you think this could
 influence the problem?

It should be worth a try. Try deadline and see if that helps.


Lately, CFQ had had a lot of improvements. But those might have not made
into the squeeze kernel.


BTW, does this issue happen only on heavy load? Or is it triggered also
on moderate load? I'm sure the moment when this issue occurs, the
machine is completely unresponsive.



-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System



signature.asc
Description: OpenPGP digital signature


Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-06-11 Thread Massimiliano Ferrero

Il 11/06/2011 10:41, Ritesh Raj Sarraf ha scritto:

It should be worth a try. Try deadline and see if that helps.


Lately, CFQ had had a lot of improvements. But those might have not made
into the squeeze kernel.


BTW, does this issue happen only on heavy load? Or is it triggered also
on moderate load? I'm sure the moment when this issue occurs, the
machine is completely unresponsive.


No, it happens only on heavy load, the storage server performs also 
several other functions other than iscsi target, one of these is 
monitoring all the virtual machines with nagios and munin: munin updates 
run every 5 minutes with a lot of data written to disk, this has never 
been a problem.
The only situation that triggers the problem is while performing an 
operation like dd or a file system check that is really i/o intensive on 
the storage server.
The second time I experienced the problem I was connected with ssh to 
the system during the dd watching the log files: i saw real time the 
abort messages and I was able to stop the dd and react to the failure 
(restart cluster nodes, check virtual machines, etc.), this just to say 
that it was not necessary to restart the iscsi target and that the 
system was not unresponsive.


I will update the bug as soon as I can test it on another machine.

Gest regards
Massimiliano

--

Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax  +39-0117768576
e-mail: m.ferr...@midhgard.it
sito web: http://www.midhgard.it




--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#629442: [Pkg-iscsi-maintainers] Bug#629442: iscsitarget: ietd gives iscsi_trgt: Abort Task errors on high disk load and iscsi connections are dropped

2011-06-11 Thread Ritesh Raj Sarraf
On 06/11/2011 05:45 PM, Massimiliano Ferrero wrote:

 No, it happens only on heavy load, the storage server performs also
 several other functions other than iscsi target, one of these is
 monitoring all the virtual machines with nagios and munin: munin updates
 run every 5 minutes with a lot of data written to disk, this has never
 been a problem.
 The only situation that triggers the problem is while performing an
 operation like dd or a file system check that is really i/o intensive on
 the storage server.
 The second time I experienced the problem I was connected with ssh to
 the system during the dd watching the log files: i saw real time the
 abort messages and I was able to stop the dd and react to the failure
 (restart cluster nodes, check virtual machines, etc.), this just to say
 that it was not necessary to restart the iscsi target and that the
 system was not unresponsive.
 
 I will update the bug as soon as I can test it on another machine.

I'm afraid this won't be an easy bug to root cause. But let's give it a
shot.

Next time, when you try to test/re-create the bug, capture dstat output.
The default dstat output is good enough to report us on the system state
was during starvation.


-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System



signature.asc
Description: OpenPGP digital signature