The interface died at 14:25:44 and recovered at 14:27:43. That's two minutes.
One solution is to increase o2cb_idle_timeout to > 2mins. Better solution would be to look into your router setup to determine why it is taking 2 minutes for the router to reconfigure. Mick Waters wrote: > Hi, my company is in the process of moving our web and database > servers to new hardware. We have a HP EVA 4100 SAN which is being > used by two database servers running in an Oracle 10g cluster and that > works fine. We have gone to extreme lengths to ensure high > availability. The SAN has twin disk arrays, twin controllers, and all > servers have dual fibre interfaces. Networking is (should be) > similarly redundant with bonded NICs connected in two-switch > configuration, two firewalls and so on. > > We also want to share regular Linux filesystems between our servers - > HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and > we chose OCFS2 (1.2.8) to manage the cluster. > > As stated, each server in the 4 node cluster has a bonded interface > set up as bond0 in a two-switch configuration (each NIC in the bond is > connected to a different switch). Because this is a two-switch > configuration, we are running the bond in active-standby mode and this > works just fine. > > Our problem occurred when we were doing failover testing where we > simulated the loss of one of the network switches by powering it off. > The result was that the servers rebooted and this make a mockery of > our attempts at a HA solution. > > Here is a short section from /var/log/messages following a reboot of > one of the switches to simulate an outage: > > -------------------------------------------------------------------------- > Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup interface > eth0 is now down > Apr 22 14:25:44 mtkws01p1 kernel: bnx2: eth0 NIC Link is Down > Apr 22 14:26:13 mtkws01p1 kernel: o2net: connection to node mtkdb01p2 > (num 1) at 10.1.3.50:7777 has been idle for 30.0 seconds, shutting it > down. > Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here > are some times that might help debug the situation: (tmr > 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv > 1208870743.673433:1208870743.673434 func (97690d75:2) > 1208870697.670758:1208870697.670760) > Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node > mtkdb01p2 (num 1) at 10.1.3.50:7777 > Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps > full duplex > Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup interface > eth0 is now up > Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_do_master_request:1418 > ERROR: link to 1 went down! > Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995 > ERROR: status = -107 > Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731 > ERROR: status = -107 > Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804 > ERROR: status = -107 > Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR: > status = -107 > Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2 > (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it > down. > Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here > are some times that might help debug the situation: (tmr > 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv > 1208870939.955992:1208870939.955993 func (97690d75:2) > 1208870697.670916:1208870697.670918) > Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node > mtkdb02p2 (num 2) at 10.1.3.51:7777 > Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731 > ERROR: status = -107 > Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804 > ERROR: status = -107 > Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR: > status = -107 > Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart. > -------------------------------------------------------------------------- > > Things that I have tried... > > I've tried setting up the bond with both miimon and ARP monitoring (at > different times of course) because when the switch comes back up the > link detect goes up and down several times while the switch > initialises and I hoped that ARP monitoring might be more reliable - > it made no difference at all. > > I've increased the heartbeat timeout to 61 from 31 but, as yet, > I haven't played with any of the other cluster configuration variables. > > Has anyone with a similar configuration experienced problems like this > and found a solution? > > Regards, > > Mick. > > ------------------------------------------------------------------------ > > Mick Waters > Senior Systems Developer > > w: +44 (0)208 335 2011 > m: +44 (0)7849 887 277 > e: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> > > www.motortrak.com <http://www.motortrak.com/> > digital media solutions > > Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR, > United Kingdom > > The information contained in this message is for the intended > addressee only and may contain confidential and/or privileged > information. If you are not the intended addressee, please delete this > message and notify the sender; do not copy or distribute this message > or disclose its contents to anyone. Any views or opinions expressed in > this message are those of the author and do not necessarily represent > those of Motortrak Limited or of any of its associated companies. No > reliance may be placed on this message without written confirmation > from an authorised representative of the company. > > Registered in England 3098391 V.A.T. Registered No. 667463890 > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
