Hi Sunil,

I looked into the timings of our switches and firewalls and have found a 
suitable value for the idle timeout.  Early testing seems to indicate that this 
does indeed fix our problem.

I want to thank you and the others who responded for your help with this.

Regards,

Mick.

-----Original Message-----
From: Sunil Mushran [mailto:[EMAIL PROTECTED]
Sent: 22 April 2008 18:20
To: Mick Waters
Cc: [email protected]
Subject: Re: [Ocfs2-users] Node reboot during network outage

The issue is not the time the switch takes to reboot. The issue is the amount 
of time the secondary switch takes to find a unique path.

http://en.wikipedia.org/wiki/Spanning_tree_protocol

Mick Waters wrote:
> Thanks Sunil,
>
> The network switch is brand new but has a fairly complex configuration due to 
> us running a number of VLANs - however, we have found that it has always 
> taken quite a while to reboot.
>
> I'll try increasing the idle timeout as suggested and let you know what 
> happens.  However, surely this is only treating the symptoms of what is, 
> after all, a contrived scenario.  Rebooting the switch is supposed to test 
> what would happen if we had a real network outage.  What if the switch were 
> to stay down?
>
> My issue is that we have an alternative route via the other NIC in the bond 
> and the other switch.  The affected nodes in cluster shouldn't fence because 
> they should still be able to see all of the other nodes in the cluster via 
> this other route.
>
> Does this make sense?
>
> Regards,
>
> Mick.
>
> -----Original Message-----
> From: Sunil Mushran [mailto:[EMAIL PROTECTED]
> Sent: 22 April 2008 17:40
> To: Mick Waters
> Cc: [email protected]
> Subject: Re: [Ocfs2-users] Node reboot during network outage
>
> The interface died at 14:25:44 and recovered at 14:27:43.
> That's two minutes.
>
> One solution is to increase o2cb_idle_timeout to > 2mins.
>
> Better solution would be to look into your router setup to determine why it 
> is taking 2 minutes for the router to reconfigure.
>
> Mick Waters wrote:
>
>> Hi, my company is in the process of moving our web and database
>> servers to new hardware.  We have a HP EVA 4100 SAN which is being
>> used by two database servers running in an Oracle 10g cluster and
>> that works fine.  We have gone to extreme lengths to ensure high
>> availability.  The SAN has twin disk arrays, twin controllers, and
>> all servers have dual fibre interfaces.  Networking is (should be)
>> similarly redundant with bonded NICs connected in two-switch
>> configuration, two firewalls and so on.
>>
>> We also want to share regular Linux filesystems between our servers -
>> HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we
>> chose OCFS2 (1.2.8) to manage the cluster.
>>
>> As stated, each server in the 4 node cluster has a bonded interface
>> set up as bond0 in a two-switch configuration (each NIC in the bond
>> is connected to a different switch).  Because this is a two-switch
>> configuration, we are running the bond in active-standby mode and
>> this works just fine.
>>
>> Our problem occurred when we were doing failover testing where we
>> simulated the loss of one of the network switches by powering it off.
>> The result was that the servers rebooted and this make a mockery of
>> our attempts at a HA solution.
>>
>> Here is a short section from /var/log/messages following a reboot of
>> one of the switches to simulate an outage:
>>
>> ---------------------------------------------------------------------
>> -
>> ---- Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup
>> interface eth0 is now down Apr 22 14:25:44 mtkws01p1 kernel: bnx2:
>> eth0 NIC Link is Down Apr 22 14:26:13 mtkws01p1 kernel: o2net:
>> connection to node mtkdb01p2 (num 1) at 10.1.3.50:7777 has been idle
>> for 30.0 seconds, shutting it down.
>> Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here
>> are some times that might help debug the situation: (tmr
>> 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv
>> 1208870743.673433:1208870743.673434 func (97690d75:2)
>> 1208870697.670758:1208870697.670760)
>> Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node
>> mtkdb01p2 (num 1) at 10.1.3.50:7777
>> Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000
>> Mbps full duplex Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0:
>> backup interface eth0 is now up Apr 22 14:28:35 mtkws01p1 kernel:
>> (5234,9):dlm_do_master_request:1418
>> ERROR: link to 1 went down!
>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995
>> ERROR: status = -107
>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731
>> ERROR: status = -107
>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804
>> ERROR: status = -107
>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR:
>> status = -107
>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2
>> (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it
>> down.
>> Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here
>> are some times that might help debug the situation: (tmr
>> 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv
>> 1208870939.955992:1208870939.955993 func (97690d75:2)
>> 1208870697.670916:1208870697.670918)
>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node
>> mtkdb02p2 (num 2) at 10.1.3.51:7777
>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731
>> ERROR: status = -107
>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804
>> ERROR: status = -107
>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR:
>> status = -107
>> Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart.
>> ---------------------------------------------------------------------
>> -
>> ----
>>
>> Things that I have tried...
>>
>> I've tried setting up the bond with both miimon and ARP monitoring
>> (at different times of course) because when the switch comes back up
>> the link detect goes up and down several times while the switch
>> initialises and I hoped that ARP monitoring might be more reliable -
>> it made no difference at all.
>>
>> I've increased the heartbeat timeout to 61 from 31 but, as yet, I
>> haven't played with any of the other cluster configuration variables.
>>
>> Has anyone with a similar configuration experienced problems like
>> this and found a solution?
>>
>> Regards,
>>
>> Mick.
>>
>> ---------------------------------------------------------------------
>> -
>> --
>>
>> Mick Waters
>> Senior Systems Developer
>>
>> w: +44 (0)208 335 2011
>> m: +44 (0)7849 887 277
>> e: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>>
>> www.motortrak.com <http://www.motortrak.com/> digital media solutions
>>
>> Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR,
>> United Kingdom
>>
>> The information contained in this message is for the intended
>> addressee only and may contain confidential and/or privileged
>> information. If you are not the intended addressee, please delete
>> this message and notify the sender; do not copy or distribute this
>> message or disclose its contents to anyone. Any views or opinions
>> expressed in this message are those of the author and do not
>> necessarily represent those of Motortrak Limited or of any of its
>> associated companies. No reliance may be placed on this message
>> without written confirmation from an authorised representative of the 
>> company.
>>
>> Registered in England 3098391 V.A.T. Registered No. 667463890
>>
>>
>> ---------------------------------------------------------------------
>> -
>> --
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> [email protected]
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>
>


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to