STP Shouldnt have anything to do with the nodes still seeing each other when the switch fails
Have you checked that your bonding config is correct? ie cat /proc/net/bonding/bond0 and check that it fails over to the other eth when the switch goes down Cheers Brendan Sunil Mushran wrote: > The issue is not the time the switch takes to reboot. The issue is the > amount of time the secondary switch takes to find a unique path. > > http://en.wikipedia.org/wiki/Spanning_tree_protocol > > Mick Waters wrote: > >> Thanks Sunil, >> >> The network switch is brand new but has a fairly complex configuration due >> to us running a number of VLANs - however, we have found that it has always >> taken quite a while to reboot. >> >> I'll try increasing the idle timeout as suggested and let you know what >> happens. However, surely this is only treating the symptoms of what is, >> after all, a contrived scenario. Rebooting the switch is supposed to test >> what would happen if we had a real network outage. What if the switch were >> to stay down? >> >> My issue is that we have an alternative route via the other NIC in the bond >> and the other switch. The affected nodes in cluster shouldn't fence because >> they should still be able to see all of the other nodes in the cluster via >> this other route. >> >> Does this make sense? >> >> Regards, >> >> Mick. >> >> -----Original Message----- >> From: Sunil Mushran [mailto:[EMAIL PROTECTED] >> Sent: 22 April 2008 17:40 >> To: Mick Waters >> Cc: [email protected] >> Subject: Re: [Ocfs2-users] Node reboot during network outage >> >> The interface died at 14:25:44 and recovered at 14:27:43. >> That's two minutes. >> >> One solution is to increase o2cb_idle_timeout to > 2mins. >> >> Better solution would be to look into your router setup to determine why it >> is taking 2 minutes for the router to reconfigure. >> >> Mick Waters wrote: >> >> >>> Hi, my company is in the process of moving our web and database >>> servers to new hardware. We have a HP EVA 4100 SAN which is being >>> used by two database servers running in an Oracle 10g cluster and that >>> works fine. We have gone to extreme lengths to ensure high >>> availability. The SAN has twin disk arrays, twin controllers, and all >>> servers have dual fibre interfaces. Networking is (should be) >>> similarly redundant with bonded NICs connected in two-switch >>> configuration, two firewalls and so on. >>> >>> We also want to share regular Linux filesystems between our servers - >>> HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we >>> chose OCFS2 (1.2.8) to manage the cluster. >>> >>> As stated, each server in the 4 node cluster has a bonded interface >>> set up as bond0 in a two-switch configuration (each NIC in the bond is >>> connected to a different switch). Because this is a two-switch >>> configuration, we are running the bond in active-standby mode and this >>> works just fine. >>> >>> Our problem occurred when we were doing failover testing where we >>> simulated the loss of one of the network switches by powering it off. >>> The result was that the servers rebooted and this make a mockery of >>> our attempts at a HA solution. >>> >>> Here is a short section from /var/log/messages following a reboot of >>> one of the switches to simulate an outage: >>> >>> ---------------------------------------------------------------------- >>> ---- Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup >>> interface eth0 is now down Apr 22 14:25:44 mtkws01p1 kernel: bnx2: >>> eth0 NIC Link is Down Apr 22 14:26:13 mtkws01p1 kernel: o2net: >>> connection to node mtkdb01p2 (num 1) at 10.1.3.50:7777 has been idle >>> for 30.0 seconds, shutting it down. >>> Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here >>> are some times that might help debug the situation: (tmr >>> 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv >>> 1208870743.673433:1208870743.673434 func (97690d75:2) >>> 1208870697.670758:1208870697.670760) >>> Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node >>> mtkdb01p2 (num 1) at 10.1.3.50:7777 >>> Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps >>> full duplex Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup >>> interface eth0 is now up Apr 22 14:28:35 mtkws01p1 kernel: >>> (5234,9):dlm_do_master_request:1418 >>> ERROR: link to 1 went down! >>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995 >>> ERROR: status = -107 >>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731 >>> ERROR: status = -107 >>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804 >>> ERROR: status = -107 >>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR: >>> status = -107 >>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2 >>> (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it >>> down. >>> Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here >>> are some times that might help debug the situation: (tmr >>> 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv >>> 1208870939.955992:1208870939.955993 func (97690d75:2) >>> 1208870697.670916:1208870697.670918) >>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node >>> mtkdb02p2 (num 2) at 10.1.3.51:7777 >>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731 >>> ERROR: status = -107 >>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804 >>> ERROR: status = -107 >>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR: >>> status = -107 >>> Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart. >>> ---------------------------------------------------------------------- >>> ---- >>> >>> Things that I have tried... >>> >>> I've tried setting up the bond with both miimon and ARP monitoring (at >>> different times of course) because when the switch comes back up the >>> link detect goes up and down several times while the switch >>> initialises and I hoped that ARP monitoring might be more reliable - >>> it made no difference at all. >>> >>> I've increased the heartbeat timeout to 61 from 31 but, as yet, I >>> haven't played with any of the other cluster configuration variables. >>> >>> Has anyone with a similar configuration experienced problems like this >>> and found a solution? >>> >>> Regards, >>> >>> Mick. >>> >>> ---------------------------------------------------------------------- >>> -- >>> >>> Mick Waters >>> Senior Systems Developer >>> >>> w: +44 (0)208 335 2011 >>> m: +44 (0)7849 887 277 >>> e: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> >>> >>> www.motortrak.com <http://www.motortrak.com/> digital media solutions >>> >>> Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR, >>> United Kingdom >>> >>> The information contained in this message is for the intended >>> addressee only and may contain confidential and/or privileged >>> information. If you are not the intended addressee, please delete this >>> message and notify the sender; do not copy or distribute this message >>> or disclose its contents to anyone. Any views or opinions expressed in >>> this message are those of the author and do not necessarily represent >>> those of Motortrak Limited or of any of its associated companies. No >>> reliance may be placed on this message without written confirmation >>> from an authorised representative of the company. >>> >>> Registered in England 3098391 V.A.T. Registered No. 667463890 >>> >>> >>> ---------------------------------------------------------------------- >>> -- >>> >>> _______________________________________________ >>> Ocfs2-users mailing list >>> [email protected] >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>> >>> >> >> > > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
