Hello, Apologies for following up on myself.
in ocfs2/cluster/tcp_internal.h #define O2NET_KEEPALIVE_DELAY_SECS 5 #define O2NET_IDLE_TIMEOUT_SECS 10 Is this really sensible? Potentially, given small variance in system clocks losing one keepalive packet (assuming that o2net_sc_send_keep_req is the only thing keeping the connection alive) the loss of one packet could cause a node to self fence and reboot. Would #define O2NET_KEEPALIVE_DELAY_SECS 5 #define O2NET_IDLE_TIMEOUT_SECS 20 Cause any problems? Andy On Thu, 2006-08-03 at 12:41 +0100, Andy Phillips wrote: > Hello, > > I've a two node 10gR2 rac cluster on a pair of sun opteron boxes. > Redhat AS 4.3 2.6.9-34.0.1.ELsmp x86_64. ocfs 1.2.2. RAC is using > ASM to talk to the data files, but we have 3 ocfs2 filesystems up > to share dba files, and the usual bits and bobs. > > Things were fine until, on mostly idle system, this happened out > of the blue; > > Aug 2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at > 172.16.6.10:7777 has been idle for 10 seconds, shutting it down. > Aug 2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some > times that might help debug the situation: (tmr 1154545576.798263 now > 1154545586.796978 dr 1154545576.798238 adv > 1154545576.798291:1154545576.798293 func (06aac8a1:1) > 1154545566.800782:1154545566.800787) > Aug 2 19:06:27 fred kernel: o2net: no longer connected to node barney > (num 0) at 172.16.6.10:7777 > Aug 2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR: > fencing this node because it is connected to > a half-quorum of 1 out of 2 nodes which doesn't include the lowest > active node 0 > Aug 2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR: > stopping heartbeat on all active regions. > > And the node then halted. > > Barney is node 0. The systems were idle. We've hammered the ocfs2 > file systems, and set o2cb_heartbeat_threshold to 61. All is good and > stable under heavy i/o. > > The interconnect is a bonded interface, with two gig cards, each > connected (with flow control on) to two separate FESX424 switches. > The switches dont register any problems at this time, nor does linux > register any interface issues. > > I'm looking at the source code at the moment, but nothing is leaping > out at me. Any ideas - Do the timer debug lines above mean anything to > anyone. > > Thanks > Andy > > > > > > -- Andy Phillips, FRAS Systems Architect, Information Systems. Betfair.com Direct Line: 0208 834 8436 Betfair Limited (Company No.5140986), Winslow Road, Hammersmith Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834 8501 (direct). The information in this e-mail and any attachment is confidential, may contain legal advice protected by privilege and is intended only for the named recipient(s). The e-mail may not be disclosed or used by any person other than the addressee, nor may it be copied in any way. If you are not a named recipient please notify the sender immediately and delete any copies of this message. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Any view or opinions presented are solely those of the author and do not necessarily represent those of the company. _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
