Are you using the IPAddr2 primitive? Maybe post your configuration? Nick.
On Sun, Oct 30, 2011 at 4:35 PM, James Smith <[email protected]> wrote: > Ipv4. > > Regards, > > James > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Nick Khamis > Sent: 30 October 2011 12:28 > To: General Linux-HA mailing list > Subject: Re: [Linux-HA] SANs falling over, don't know why! > > Are you using IPV4 or 6? > > Nick. > > On Sun, Oct 30, 2011 at 4:29 AM, James Smith <[email protected]> wrote: >> Well fileio hasn't solved the underlying issue, the SAN broke this morning >> at 6AM: >> >> Oct 30 06:01:19 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued >> on tid:1 lun:6 by sid:4222124721766912 (Function Complete) Oct 30 >> 06:01:20 iscsi1cl6 lrmd: [3770]: info: RA output: >> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >> 24 Oct 30 06:01:47 iscsi1cl6 last message repeated 27 times Oct 30 >> 06:01:48 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 >> lun:6 by sid:5066549651898880 (Function Complete) Oct 30 06:01:48 >> iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >> sid:5066549651898880 (Function Complete) Oct 30 06:01:48 iscsi1cl6 >> lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted >> dotted-quad netmask to CIDR as: 24 Oct 30 06:02:19 iscsi1cl6 last >> message repeated 30 times Oct 30 06:02:48 iscsi1cl6 last message >> repeated 28 times Oct 30 06:02:49 iscsi1cl6 kernel: iscsi_trgt: Abort >> Task (01) issued on tid:1 lun:6 by sid:5066549651898880 (Function >> Complete) Oct 30 06:02:49 iscsi1cl6 lrmd: [3770]: info: RA output: >> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >> 24 Oct 30 06:02:51 iscsi1cl6 last message repeated 2 times Oct 30 >> 06:02:52 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 >> lun:6 by sid:4222124721766912 (Function Complete) Oct 30 06:02:52 >> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) >> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:17 iscsi1cl6 >> last message repeated 24 times Oct 30 06:03:18 iscsi1cl6 kernel: >> iscsi_trgt: cmnd_rx_start(1849) 1 3b000030 -7 Oct 30 06:03:18 >> iscsi1cl6 kernel: iscsi_trgt: cmnd_skip_pdu(459) 3b000030 1 2a 4096 >> Oct 30 06:03:18 iscsi1cl6 lrmd: [3770]: info: RA output: >> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >> 24 Oct 30 06:03:49 iscsi1cl6 last message repeated 30 times Oct 30 >> 06:04:32 iscsi1cl6 last message repeated 42 times Oct 30 06:04:33 >> iscsi1cl6 cib: [3769]: info: cib_stats: Processed 1 operations >> (10000.00us average, 0% utilization) in the last 10min Oct 30 06:04:33 >> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) >> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:05:04 iscsi1cl6 >> last message repeated 30 times Oct 30 06:05:41 iscsi1cl6 last message >> repeated 36 times Oct 30 06:05:42 iscsi1cl6 kernel: iscsi_trgt: Abort >> Task (01) issued on tid:1 lun:6 by sid:5629499605320192 (Function >> Complete) >> >> >> Regards, >> >> James >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of James Smith >> Sent: 30 October 2011 00:25 >> To: General Linux-HA mailing list >> Subject: Re: [Linux-HA] SANs falling over, don't know why! >> >> Hi, >> >> Changed nothing to my knowledge :p >> >> These boxes don't currently have fencing enabled. I imagine the reboot is >> caused by a kernel panic, sysctl is set to reboot on this. >> >> There is one big 4TB LUN, used by several VMs on XenServer, each with >> multiple disks. >> >> In my quest to resolve, I have changed iet to use fileio instead of blockio >> and fiddled with some drbd performance related bits >> (http://www.drbd.org/users-guide/s-latency-tuning.html). >> >> If I'm woken up again tonight with this thing breaking it's going in the >> bin. I'll probably also ditch ietd and try open-iscsi or iscsi-scst. >> Monday morning I'll be shifting some load off this cluster also. >> >> Regards, >> >> James >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Andreas Kurz >> Sent: 29 October 2011 22:36 >> To: [email protected] >> Subject: Re: [Linux-HA] SANs falling over, don't know why! >> >> Hello, >> >> On 10/29/2011 08:47 PM, James Smith wrote: >>> Hi, >>> >>> All of a sudden, a SAN pair which was running without any problems for six >>> months, now decides to fall over every couple of hours. >> >> So what did you change? ;-) >> >>> >>> The logs I have to go on are below: >>> >>> Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29 >>> 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on >>> tid:1 >>> lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24 >>> iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) >>> Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49 >>> iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel: >>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>> sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6 >>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>> sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6 >>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>> sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6 >>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted >>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel: >>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by >>> sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6 >>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted >>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last >>> message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: >>> Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 >>> (Function >>> Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task >>> (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) >>> Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: >>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >>> 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29 >>> 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on >>> tid:1 >>> lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06 >>> iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6 >>> kernel: block drbd0: istiod1[4695] Concurrent local write detected! >>> [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 29 >>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent >>> local write detected! [DISCARD L] new: 2077806184s +512; pending: >>> 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: >>> istiod1[4695] Concurrent local write detected! [DISCARD L] new: >>> 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06 >>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write >>> detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s >>> +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] >>> Concurrent local write detected! [DISCARD L] new: 1693425321s +3584; >>> pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block >>> drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: >>> 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 iscsi2cl6 >>> kernel: block drbd0: istiod1[4695] Concurrent local write detected! >>> [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct 29 >>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent >>> local write detected! [DISCARD L] new: 1693425320s +512; pending: >>> 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: >>> istiod1[4695] Concurrent local write detected! [DISCARD L] new: >>> 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06 >>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write >>> detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s >>> +512 >> >> Concurrent local writes .... Is there any kind of cluster software using a >> shared quorum disk or sthg. like that using this lun? Or this lun shared >> between several VMWare ESX VMs? >> >>> Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: >>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: >>> 24 >>> >>> After this event, both members of the SAN pair reboot. It is very >>> disruptive, as it's killing the VMs using this SAN, requiring fsck's after >>> failure. The load on the SAN doesn't need to be very high for this happen. >>> >> >> They reboot because of a kernel panic, or because of some fencing mechanism? >> >>> Running the following: >>> >>> CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker >>> 1.0.11-1.2.el5 DRBD 8.3.11 >> >> Would be interesting to see Pacemaker/DRBD/IET config .... >> >> Regards, >> Andreas >> -- >> Need help with Pacemaker? >> http://www.hastexo.com/now >> >> >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
