Hi,

Changed nothing to my knowledge :p

These boxes don't currently have fencing enabled.  I imagine the reboot is 
caused by a kernel panic, sysctl is set to reboot on this.

There is one big 4TB LUN, used by several VMs on XenServer, each with multiple 
disks.

In my quest to resolve, I have changed iet to use fileio instead of blockio and 
fiddled with some drbd performance related bits 
(http://www.drbd.org/users-guide/s-latency-tuning.html).

If I'm woken up again tonight with this thing breaking it's going in the bin.  
I'll probably also ditch ietd and try open-iscsi or iscsi-scst.  Monday morning 
I'll be shifting some load off this cluster also.

Regards,

James

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Andreas Kurz
Sent: 29 October 2011 22:36
To: [email protected]
Subject: Re: [Linux-HA] SANs falling over, don't know why!

Hello,

On 10/29/2011 08:47 PM, James Smith wrote:
> Hi,
> 
> All of a sudden, a SAN pair which was running without any problems for six 
> months, now decides to fall over every couple of hours.

So what did you change? ;-)

> 
> The logs I have to go on are below:
> 
> Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29 
> 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24 
> iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) 
> Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49 iscsi2cl6 
> last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel: 
> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by 
> sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6 
> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by 
> sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6 
> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by 
> sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6 
> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel: 
> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by 
> sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6 
> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last 
> message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: 
> Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function 
> Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task 
> (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) 
> Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: 
> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 
> 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29 
> 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
> lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06 
> iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6 
> kernel: block drbd0: istiod1[4695] Concurrent local write detected! 
> [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 29 
> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 2077806184s +512; pending: 
> 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: 
> istiod1[4695] Concurrent local write detected! [DISCARD L] new: 
> 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06 
> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write 
> detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s +512 
> Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] 
> Concurrent local write detected! [DISCARD L] new: 1693425321s +3584; 
> pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block 
> drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 
> 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 iscsi2cl6 
> kernel: block drbd0: istiod1[4695] Concurrent local write detected! 
> [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct 29 
> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
> write detected! [DISCARD L] new: 1693425320s +512; pending: 
> 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: 
> istiod1[4695] Concurrent local write detected! [DISCARD L] new: 
> 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06 
> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write 
> detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s +512

Concurrent local writes .... Is there any kind of cluster software using a 
shared quorum disk or sthg. like that using this lun? Or this lun shared 
between several VMWare ESX VMs?

> Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: 
> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 
> 24
> 
> After this event, both members of the SAN pair reboot.  It is very 
> disruptive, as it's killing the VMs using this SAN, requiring fsck's after 
> failure.  The load on the SAN doesn't need to be very high for this happen.
> 

They reboot because of a kernel panic, or because of some fencing mechanism?

> Running the following:
> 
> CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker 
> 1.0.11-1.2.el5 DRBD 8.3.11

Would be interesting to see Pacemaker/DRBD/IET config ....

Regards,
Andreas
--
Need help with Pacemaker?
http://www.hastexo.com/now


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to