STP Shouldnt have anything to do with the nodes still seeing each other 
when the switch fails

Have you checked that your bonding config is correct?
ie

cat /proc/net/bonding/bond0
and check that it fails over to the other eth when the switch goes down

Cheers
Brendan


Sunil Mushran wrote:
> The issue is not the time the switch takes to reboot. The issue is the
> amount of time the secondary switch takes to find a unique path.
>
> http://en.wikipedia.org/wiki/Spanning_tree_protocol
>
> Mick Waters wrote:
>   
>> Thanks Sunil,
>>
>> The network switch is brand new but has a fairly complex configuration due 
>> to us running a number of VLANs - however, we have found that it has always 
>> taken quite a while to reboot.
>>
>> I'll try increasing the idle timeout as suggested and let you know what 
>> happens.  However, surely this is only treating the symptoms of what is, 
>> after all, a contrived scenario.  Rebooting the switch is supposed to test 
>> what would happen if we had a real network outage.  What if the switch were 
>> to stay down?
>>
>> My issue is that we have an alternative route via the other NIC in the bond 
>> and the other switch.  The affected nodes in cluster shouldn't fence because 
>> they should still be able to see all of the other nodes in the cluster via 
>> this other route.
>>
>> Does this make sense?
>>
>> Regards,
>>
>> Mick.
>>
>> -----Original Message-----
>> From: Sunil Mushran [mailto:[EMAIL PROTECTED]
>> Sent: 22 April 2008 17:40
>> To: Mick Waters
>> Cc: [email protected]
>> Subject: Re: [Ocfs2-users] Node reboot during network outage
>>
>> The interface died at 14:25:44 and recovered at 14:27:43.
>> That's two minutes.
>>
>> One solution is to increase o2cb_idle_timeout to > 2mins.
>>
>> Better solution would be to look into your router setup to determine why it 
>> is taking 2 minutes for the router to reconfigure.
>>
>> Mick Waters wrote:
>>   
>>     
>>> Hi, my company is in the process of moving our web and database
>>> servers to new hardware.  We have a HP EVA 4100 SAN which is being
>>> used by two database servers running in an Oracle 10g cluster and that
>>> works fine.  We have gone to extreme lengths to ensure high
>>> availability.  The SAN has twin disk arrays, twin controllers, and all
>>> servers have dual fibre interfaces.  Networking is (should be)
>>> similarly redundant with bonded NICs connected in two-switch
>>> configuration, two firewalls and so on.
>>>
>>> We also want to share regular Linux filesystems between our servers -
>>> HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we
>>> chose OCFS2 (1.2.8) to manage the cluster.
>>>
>>> As stated, each server in the 4 node cluster has a bonded interface
>>> set up as bond0 in a two-switch configuration (each NIC in the bond is
>>> connected to a different switch).  Because this is a two-switch
>>> configuration, we are running the bond in active-standby mode and this
>>> works just fine.
>>>
>>> Our problem occurred when we were doing failover testing where we
>>> simulated the loss of one of the network switches by powering it off.
>>> The result was that the servers rebooted and this make a mockery of
>>> our attempts at a HA solution.
>>>
>>> Here is a short section from /var/log/messages following a reboot of
>>> one of the switches to simulate an outage:
>>>
>>> ----------------------------------------------------------------------
>>> ---- Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup
>>> interface eth0 is now down Apr 22 14:25:44 mtkws01p1 kernel: bnx2:
>>> eth0 NIC Link is Down Apr 22 14:26:13 mtkws01p1 kernel: o2net:
>>> connection to node mtkdb01p2 (num 1) at 10.1.3.50:7777 has been idle
>>> for 30.0 seconds, shutting it down.
>>> Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here
>>> are some times that might help debug the situation: (tmr
>>> 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv
>>> 1208870743.673433:1208870743.673434 func (97690d75:2)
>>> 1208870697.670758:1208870697.670760)
>>> Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node
>>> mtkdb01p2 (num 1) at 10.1.3.50:7777
>>> Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps
>>> full duplex Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup
>>> interface eth0 is now up Apr 22 14:28:35 mtkws01p1 kernel:
>>> (5234,9):dlm_do_master_request:1418
>>> ERROR: link to 1 went down!
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995
>>> ERROR: status = -107
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731
>>> ERROR: status = -107
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804
>>> ERROR: status = -107
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR:
>>> status = -107
>>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2
>>> (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it
>>> down.
>>> Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here
>>> are some times that might help debug the situation: (tmr
>>> 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv
>>> 1208870939.955992:1208870939.955993 func (97690d75:2)
>>> 1208870697.670916:1208870697.670918)
>>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node
>>> mtkdb02p2 (num 2) at 10.1.3.51:7777
>>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731
>>> ERROR: status = -107
>>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804
>>> ERROR: status = -107
>>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR:
>>> status = -107
>>> Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart.
>>> ----------------------------------------------------------------------
>>> ----
>>>
>>> Things that I have tried...
>>>
>>> I've tried setting up the bond with both miimon and ARP monitoring (at
>>> different times of course) because when the switch comes back up the
>>> link detect goes up and down several times while the switch
>>> initialises and I hoped that ARP monitoring might be more reliable -
>>> it made no difference at all.
>>>
>>> I've increased the heartbeat timeout to 61 from 31 but, as yet, I
>>> haven't played with any of the other cluster configuration variables.
>>>
>>> Has anyone with a similar configuration experienced problems like this
>>> and found a solution?
>>>
>>> Regards,
>>>
>>> Mick.
>>>
>>> ----------------------------------------------------------------------
>>> --
>>>
>>> Mick Waters
>>> Senior Systems Developer
>>>
>>> w: +44 (0)208 335 2011
>>> m: +44 (0)7849 887 277
>>> e: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>>>
>>> www.motortrak.com <http://www.motortrak.com/> digital media solutions
>>>
>>> Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR,
>>> United Kingdom
>>>
>>> The information contained in this message is for the intended
>>> addressee only and may contain confidential and/or privileged
>>> information. If you are not the intended addressee, please delete this
>>> message and notify the sender; do not copy or distribute this message
>>> or disclose its contents to anyone. Any views or opinions expressed in
>>> this message are those of the author and do not necessarily represent
>>> those of Motortrak Limited or of any of its associated companies. No
>>> reliance may be placed on this message without written confirmation
>>> from an authorised representative of the company.
>>>
>>> Registered in England 3098391 V.A.T. Registered No. 667463890
>>>
>>>
>>> ----------------------------------------------------------------------
>>> --
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> [email protected]
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>     
>>>       
>>   
>>     
>
>
> _______________________________________________
> Ocfs2-users mailing list
> [email protected]
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to