Re: [Ocfs2-users] Node reboot during network outage

Sunil Mushran Tue, 22 Apr 2008 09:45:44 -0700

The interface died at 14:25:44 and recovered at 14:27:43.
That's two minutes.


One solution is to increase o2cb_idle_timeout to > 2mins.

Better solution would be to look into your router setup to determine
why it is taking 2 minutes for the router to reconfigure.

Mick Waters wrote:
> Hi, my company is in the process of moving our web and database 
> servers to new hardware.  We have a HP EVA 4100 SAN which is being 
> used by two database servers running in an Oracle 10g cluster and that 
> works fine.  We have gone to extreme lengths to ensure high 
> availability.  The SAN has twin disk arrays, twin controllers, and all 
> servers have dual fibre interfaces.  Networking is (should be) 
> similarly redundant with bonded NICs connected in two-switch 
> configuration, two firewalls and so on.
>  
> We also want to share regular Linux filesystems between our servers - 
> HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and 
> we chose OCFS2 (1.2.8) to manage the cluster.
>  
> As stated, each server in the 4 node cluster has a bonded interface 
> set up as bond0 in a two-switch configuration (each NIC in the bond is 
> connected to a different switch).  Because this is a two-switch 
> configuration, we are running the bond in active-standby mode and this 
> works just fine.
>  
> Our problem occurred when we were doing failover testing where we 
> simulated the loss of one of the network switches by powering it off.  
> The result was that the servers rebooted and this make a mockery of 
> our attempts at a HA solution.
>  
> Here is a short section from /var/log/messages following a reboot of 
> one of the switches to simulate an outage:
>  
> --------------------------------------------------------------------------
> Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup interface 
> eth0 is now down
> Apr 22 14:25:44 mtkws01p1 kernel: bnx2: eth0 NIC Link is Down
> Apr 22 14:26:13 mtkws01p1 kernel: o2net: connection to node mtkdb01p2 
> (num 1) at 10.1.3.50:7777 has been idle for 30.0 seconds, shutting it 
> down.
> Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here 
> are some times that might help debug the situation: (tmr 
> 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv 
> 1208870743.673433:1208870743.673434 func (97690d75:2) 
> 1208870697.670758:1208870697.670760)
> Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node 
> mtkdb01p2 (num 1) at 10.1.3.50:7777
> Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps 
> full duplex
> Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup interface 
> eth0 is now up
> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_do_master_request:1418 
> ERROR: link to 1 went down!
> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995 
> ERROR: status = -107
> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731 
> ERROR: status = -107
> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804 
> ERROR: status = -107
> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR: 
> status = -107
> Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2 
> (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it 
> down.
> Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here 
> are some times that might help debug the situation: (tmr 
> 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv 
> 1208870939.955992:1208870939.955993 func (97690d75:2) 
> 1208870697.670916:1208870697.670918)
> Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node 
> mtkdb02p2 (num 2) at 10.1.3.51:7777
> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731 
> ERROR: status = -107
> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804 
> ERROR: status = -107
> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR: 
> status = -107
> Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart.
> --------------------------------------------------------------------------
>  
> Things that I have tried...
>  
> I've tried setting up the bond with both miimon and ARP monitoring (at 
> different times of course) because when the switch comes back up the 
> link detect goes up and down several times while the switch 
> initialises and I hoped that ARP monitoring might be more reliable - 
> it made no difference at all.
>  
> I've increased the heartbeat timeout to 61 from 31 but, as yet, 
> I haven't played with any of the other cluster configuration variables.
>  
> Has anyone with a similar configuration experienced problems like this 
> and found a solution?
>  
> Regards,
>  
> Mick.
>  
> ------------------------------------------------------------------------
>
> Mick Waters
> Senior Systems Developer
>
> w: +44 (0)208 335 2011
> m: +44 (0)7849 887 277
> e: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>
> www.motortrak.com <http://www.motortrak.com/>
> digital media solutions
>
> Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR, 
> United Kingdom
>
> The information contained in this message is for the intended 
> addressee only and may contain confidential and/or privileged 
> information. If you are not the intended addressee, please delete this 
> message and notify the sender; do not copy or distribute this message 
> or disclose its contents to anyone. Any views or opinions expressed in 
> this message are those of the author and do not necessarily represent 
> those of Motortrak Limited or of any of its associated companies. No 
> reliance may be placed on this message without written confirmation 
> from an authorised representative of the company.
>
> Registered in England 3098391 V.A.T. Registered No. 667463890
>
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> [email protected]
> http://oss.oracle.com/mailman/listinfo/ocfs2-users


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Node reboot during network outage

Reply via email to