[Linux-HA] SANs falling over, don't know why!

James Smith Sat, 29 Oct 2011 11:47:30 -0700

Hi,

All of a sudden, a SAN pair which was running without any problems for six 
months, now decides to fall over every couple of hours.


The logs I have to go on are below:

Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times
Oct 29 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:844424967684608 (Function Complete)
Oct 29 19:09:24 iscsi2cl6 lrmd: [4677]: info: RA output: 
(ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
Oct 29 19:09:49 iscsi2cl6 last message repeated 24 times
Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:1125899927618048 (Function Complete)
Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:1407374904328704 (Function Complete)
Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:281474997486080 (Function Complete)
Oct 29 19:09:50 iscsi2cl6 lrmd: [4677]: info: RA output: 
(ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
Oct 29 19:09:50 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:562949974196736 (Function Complete)
Oct 29 19:09:51 iscsi2cl6 lrmd: [4677]: info: RA output: 
(ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
Oct 29 19:09:53 iscsi2cl6 last message repeated 2 times
Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:844424967684608 (Function Complete)
Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:844424967684608 (Function Complete)
Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: 
(ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times
Oct 29 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
lun:6 by sid:1407374904328704 (Function Complete)
Oct 29 19:10:06 iscsi2cl6 last message repeated 4 times
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 2077806184s +512; pending: 2077806184s +512
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1693425337s +3584; pending: 1693425337s +3584
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s +512
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1693425321s +3584; pending: 1693425321s +3584
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1693425328s +512; pending: 1693425328s +512
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1693425320s +512; pending: 1693425320s +512
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1743088585s +3584; pending: 1743088585s +3584
Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
write detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s +512
Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: 
(ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24

After this event, both members of the SAN pair reboot.  It is very disruptive, 
as it's killing the VMs using this SAN, requiring fsck's after failure.  The 
load on the SAN doesn't need to be very high for this happen.

Running the following:

CentOS 5 with kernel 2.6.18-274.7.1.el5
IET 1.4.20.2
Pacemaker 1.0.11-1.2.el5
DRBD 8.3.11

Googling appears to reveal many possible reasons for these Abort Tasks, any 
help appreciated :(

Regards,

James

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] SANs falling over, don't know why!

Reply via email to