Hi, Changed nothing to my knowledge :p
These boxes don't currently have fencing enabled. I imagine the reboot is caused by a kernel panic, sysctl is set to reboot on this. There is one big 4TB LUN, used by several VMs on XenServer, each with multiple disks. In my quest to resolve, I have changed iet to use fileio instead of blockio and fiddled with some drbd performance related bits (http://www.drbd.org/users-guide/s-latency-tuning.html). If I'm woken up again tonight with this thing breaking it's going in the bin. I'll probably also ditch ietd and try open-iscsi or iscsi-scst. Monday morning I'll be shifting some load off this cluster also. Regards, James -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Andreas Kurz Sent: 29 October 2011 22:36 To: [email protected] Subject: Re: [Linux-HA] SANs falling over, don't know why! Hello, On 10/29/2011 08:47 PM, James Smith wrote: > Hi, > > All of a sudden, a SAN pair which was running without any problems for six > months, now decides to fall over every couple of hours. So what did you change? ;-) > > The logs I have to go on are below: > > Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29 > 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24 > iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) > Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49 iscsi2cl6 > last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel: > iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by > sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6 > kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by > sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6 > kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by > sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6 > lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted > dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel: > iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by > sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6 > lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted > dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last > message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: > Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function > Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task > (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) > Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: > (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: > 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29 > 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 > lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06 > iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6 > kernel: block drbd0: istiod1[4695] Concurrent local write detected! > [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 29 > 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 2077806184s +512; pending: > 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: > istiod1[4695] Concurrent local write detected! [DISCARD L] new: > 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06 > iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write > detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s +512 > Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] > Concurrent local write detected! [DISCARD L] new: 1693425321s +3584; > pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block > drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: > 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 iscsi2cl6 > kernel: block drbd0: istiod1[4695] Concurrent local write detected! > [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct 29 > 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local > write detected! [DISCARD L] new: 1693425320s +512; pending: > 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: > istiod1[4695] Concurrent local write detected! [DISCARD L] new: > 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06 > iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write > detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s +512 Concurrent local writes .... Is there any kind of cluster software using a shared quorum disk or sthg. like that using this lun? Or this lun shared between several VMWare ESX VMs? > Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: > (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: > 24 > > After this event, both members of the SAN pair reboot. It is very > disruptive, as it's killing the VMs using this SAN, requiring fsck's after > failure. The load on the SAN doesn't need to be very high for this happen. > They reboot because of a kernel panic, or because of some fencing mechanism? > Running the following: > > CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker > 1.0.11-1.2.el5 DRBD 8.3.11 Would be interesting to see Pacemaker/DRBD/IET config .... Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
