The issue is not the time the switch takes to reboot. The issue is the amount of time the secondary switch takes to find a unique path.
http://en.wikipedia.org/wiki/Spanning_tree_protocol Mick Waters wrote: > Thanks Sunil, > > The network switch is brand new but has a fairly complex configuration due to > us running a number of VLANs - however, we have found that it has always > taken quite a while to reboot. > > I'll try increasing the idle timeout as suggested and let you know what > happens. However, surely this is only treating the symptoms of what is, > after all, a contrived scenario. Rebooting the switch is supposed to test > what would happen if we had a real network outage. What if the switch were > to stay down? > > My issue is that we have an alternative route via the other NIC in the bond > and the other switch. The affected nodes in cluster shouldn't fence because > they should still be able to see all of the other nodes in the cluster via > this other route. > > Does this make sense? > > Regards, > > Mick. > > -----Original Message----- > From: Sunil Mushran [mailto:[EMAIL PROTECTED] > Sent: 22 April 2008 17:40 > To: Mick Waters > Cc: [email protected] > Subject: Re: [Ocfs2-users] Node reboot during network outage > > The interface died at 14:25:44 and recovered at 14:27:43. > That's two minutes. > > One solution is to increase o2cb_idle_timeout to > 2mins. > > Better solution would be to look into your router setup to determine why it > is taking 2 minutes for the router to reconfigure. > > Mick Waters wrote: > >> Hi, my company is in the process of moving our web and database >> servers to new hardware. We have a HP EVA 4100 SAN which is being >> used by two database servers running in an Oracle 10g cluster and that >> works fine. We have gone to extreme lengths to ensure high >> availability. The SAN has twin disk arrays, twin controllers, and all >> servers have dual fibre interfaces. Networking is (should be) >> similarly redundant with bonded NICs connected in two-switch >> configuration, two firewalls and so on. >> >> We also want to share regular Linux filesystems between our servers - >> HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we >> chose OCFS2 (1.2.8) to manage the cluster. >> >> As stated, each server in the 4 node cluster has a bonded interface >> set up as bond0 in a two-switch configuration (each NIC in the bond is >> connected to a different switch). Because this is a two-switch >> configuration, we are running the bond in active-standby mode and this >> works just fine. >> >> Our problem occurred when we were doing failover testing where we >> simulated the loss of one of the network switches by powering it off. >> The result was that the servers rebooted and this make a mockery of >> our attempts at a HA solution. >> >> Here is a short section from /var/log/messages following a reboot of >> one of the switches to simulate an outage: >> >> ---------------------------------------------------------------------- >> ---- Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup >> interface eth0 is now down Apr 22 14:25:44 mtkws01p1 kernel: bnx2: >> eth0 NIC Link is Down Apr 22 14:26:13 mtkws01p1 kernel: o2net: >> connection to node mtkdb01p2 (num 1) at 10.1.3.50:7777 has been idle >> for 30.0 seconds, shutting it down. >> Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here >> are some times that might help debug the situation: (tmr >> 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv >> 1208870743.673433:1208870743.673434 func (97690d75:2) >> 1208870697.670758:1208870697.670760) >> Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node >> mtkdb01p2 (num 1) at 10.1.3.50:7777 >> Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps >> full duplex Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup >> interface eth0 is now up Apr 22 14:28:35 mtkws01p1 kernel: >> (5234,9):dlm_do_master_request:1418 >> ERROR: link to 1 went down! >> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995 >> ERROR: status = -107 >> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731 >> ERROR: status = -107 >> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804 >> ERROR: status = -107 >> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR: >> status = -107 >> Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2 >> (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it >> down. >> Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here >> are some times that might help debug the situation: (tmr >> 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv >> 1208870939.955992:1208870939.955993 func (97690d75:2) >> 1208870697.670916:1208870697.670918) >> Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node >> mtkdb02p2 (num 2) at 10.1.3.51:7777 >> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731 >> ERROR: status = -107 >> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804 >> ERROR: status = -107 >> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR: >> status = -107 >> Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart. >> ---------------------------------------------------------------------- >> ---- >> >> Things that I have tried... >> >> I've tried setting up the bond with both miimon and ARP monitoring (at >> different times of course) because when the switch comes back up the >> link detect goes up and down several times while the switch >> initialises and I hoped that ARP monitoring might be more reliable - >> it made no difference at all. >> >> I've increased the heartbeat timeout to 61 from 31 but, as yet, I >> haven't played with any of the other cluster configuration variables. >> >> Has anyone with a similar configuration experienced problems like this >> and found a solution? >> >> Regards, >> >> Mick. >> >> ---------------------------------------------------------------------- >> -- >> >> Mick Waters >> Senior Systems Developer >> >> w: +44 (0)208 335 2011 >> m: +44 (0)7849 887 277 >> e: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> >> >> www.motortrak.com <http://www.motortrak.com/> digital media solutions >> >> Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR, >> United Kingdom >> >> The information contained in this message is for the intended >> addressee only and may contain confidential and/or privileged >> information. If you are not the intended addressee, please delete this >> message and notify the sender; do not copy or distribute this message >> or disclose its contents to anyone. Any views or opinions expressed in >> this message are those of the author and do not necessarily represent >> those of Motortrak Limited or of any of its associated companies. No >> reliance may be placed on this message without written confirmation >> from an authorised representative of the company. >> >> Registered in England 3098391 V.A.T. Registered No. 667463890 >> >> >> ---------------------------------------------------------------------- >> -- >> >> _______________________________________________ >> Ocfs2-users mailing list >> [email protected] >> http://oss.oracle.com/mailman/listinfo/ocfs2-users >> > > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
