Re: [Linux-HA] SANs falling over, don't know why!

Nick Khamis Sun, 30 Oct 2011 13:44:41 -0700

Are you using the IPAddr2 primitive? Maybe post your configuration?

Nick.


On Sun, Oct 30, 2011 at 4:35 PM, James Smith <[email protected]> wrote:
> Ipv4.
>
> Regards,
>
> James
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Nick Khamis
> Sent: 30 October 2011 12:28
> To: General Linux-HA mailing list
> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>
> Are you using IPV4 or 6?
>
> Nick.
>
> On Sun, Oct 30, 2011 at 4:29 AM, James Smith <[email protected]> wrote:
>> Well fileio hasn't solved the underlying issue, the SAN broke this morning 
>> at 6AM:
>>
>> Oct 30 06:01:19 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued
>> on tid:1 lun:6 by sid:4222124721766912 (Function Complete) Oct 30
>> 06:01:20 iscsi1cl6 lrmd: [3770]: info: RA output:
>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>> 24 Oct 30 06:01:47 iscsi1cl6 last message repeated 27 times Oct 30
>> 06:01:48 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1
>> lun:6 by sid:5066549651898880 (Function Complete) Oct 30 06:01:48
>> iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>> sid:5066549651898880 (Function Complete) Oct 30 06:01:48 iscsi1cl6
>> lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted
>> dotted-quad netmask to CIDR as: 24 Oct 30 06:02:19 iscsi1cl6 last
>> message repeated 30 times Oct 30 06:02:48 iscsi1cl6 last message
>> repeated 28 times Oct 30 06:02:49 iscsi1cl6 kernel: iscsi_trgt: Abort
>> Task (01) issued on tid:1 lun:6 by sid:5066549651898880 (Function
>> Complete) Oct 30 06:02:49 iscsi1cl6 lrmd: [3770]: info: RA output:
>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>> 24 Oct 30 06:02:51 iscsi1cl6 last message repeated 2 times Oct 30
>> 06:02:52 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1
>> lun:6 by sid:4222124721766912 (Function Complete) Oct 30 06:02:52
>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr)
>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:17 iscsi1cl6
>> last message repeated 24 times Oct 30 06:03:18 iscsi1cl6 kernel:
>> iscsi_trgt: cmnd_rx_start(1849) 1 3b000030 -7 Oct 30 06:03:18
>> iscsi1cl6 kernel: iscsi_trgt: cmnd_skip_pdu(459) 3b000030 1 2a 4096
>> Oct 30 06:03:18 iscsi1cl6 lrmd: [3770]: info: RA output:
>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>> 24 Oct 30 06:03:49 iscsi1cl6 last message repeated 30 times Oct 30
>> 06:04:32 iscsi1cl6 last message repeated 42 times Oct 30 06:04:33
>> iscsi1cl6 cib: [3769]: info: cib_stats: Processed 1 operations
>> (10000.00us average, 0% utilization) in the last 10min Oct 30 06:04:33
>> iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr)
>> Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:05:04 iscsi1cl6
>> last message repeated 30 times Oct 30 06:05:41 iscsi1cl6 last message
>> repeated 36 times Oct 30 06:05:42 iscsi1cl6 kernel: iscsi_trgt: Abort
>> Task (01) issued on tid:1 lun:6 by sid:5629499605320192 (Function
>> Complete)
>>
>>
>> Regards,
>>
>> James
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of James Smith
>> Sent: 30 October 2011 00:25
>> To: General Linux-HA mailing list
>> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>>
>> Hi,
>>
>> Changed nothing to my knowledge :p
>>
>> These boxes don't currently have fencing enabled.  I imagine the reboot is 
>> caused by a kernel panic, sysctl is set to reboot on this.
>>
>> There is one big 4TB LUN, used by several VMs on XenServer, each with 
>> multiple disks.
>>
>> In my quest to resolve, I have changed iet to use fileio instead of blockio 
>> and fiddled with some drbd performance related bits 
>> (http://www.drbd.org/users-guide/s-latency-tuning.html).
>>
>> If I'm woken up again tonight with this thing breaking it's going in the 
>> bin.  I'll probably also ditch ietd and try open-iscsi or iscsi-scst.  
>> Monday morning I'll be shifting some load off this cluster also.
>>
>> Regards,
>>
>> James
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Andreas Kurz
>> Sent: 29 October 2011 22:36
>> To: [email protected]
>> Subject: Re: [Linux-HA] SANs falling over, don't know why!
>>
>> Hello,
>>
>> On 10/29/2011 08:47 PM, James Smith wrote:
>>> Hi,
>>>
>>> All of a sudden, a SAN pair which was running without any problems for six 
>>> months, now decides to fall over every couple of hours.
>>
>> So what did you change? ;-)
>>
>>>
>>> The logs I have to go on are below:
>>>
>>> Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29
>>> 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
>>> tid:1
>>> lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24
>>> iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr)
>>> Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49
>>> iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel:
>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6
>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6
>>> kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6
>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted
>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel:
>>> iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
>>> sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6
>>> lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted
>>> dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last
>>> message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt:
>>> Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608
>>> (Function
>>> Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task
>>> (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete)
>>> Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output:
>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>> 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29
>>> 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
>>> tid:1
>>> lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06
>>> iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6
>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected!
>>> [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 29
>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent
>>> local write detected! [DISCARD L] new: 2077806184s +512; pending:
>>> 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0:
>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>> 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06
>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write
>>> detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s
>>> +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695]
>>> Concurrent local write detected! [DISCARD L] new: 1693425321s +3584;
>>> pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block
>>> drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>> 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 iscsi2cl6
>>> kernel: block drbd0: istiod1[4695] Concurrent local write detected!
>>> [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct 29
>>> 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent
>>> local write detected! [DISCARD L] new: 1693425320s +512; pending:
>>> 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0:
>>> istiod1[4695] Concurrent local write detected! [DISCARD L] new:
>>> 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06
>>> iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write
>>> detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s
>>> +512
>>
>> Concurrent local writes .... Is there any kind of cluster software using a 
>> shared quorum disk or sthg. like that using this lun? Or this lun shared 
>> between several VMWare ESX VMs?
>>
>>> Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output:
>>> (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
>>> 24
>>>
>>> After this event, both members of the SAN pair reboot.  It is very 
>>> disruptive, as it's killing the VMs using this SAN, requiring fsck's after 
>>> failure.  The load on the SAN doesn't need to be very high for this happen.
>>>
>>
>> They reboot because of a kernel panic, or because of some fencing mechanism?
>>
>>> Running the following:
>>>
>>> CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker
>>> 1.0.11-1.2.el5 DRBD 8.3.11
>>
>> Would be interesting to see Pacemaker/DRBD/IET config ....
>>
>> Regards,
>> Andreas
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] SANs falling over, don't know why!

Reply via email to