Hello, Its doubly odd then. We'll need to schedule an upgrade to 1.2.3. In the mean time, we've scheduled a cron job that touches a file on each ocfs2 file system every 3 seconds. This should ensure a constant flow of traffic assuming metadata updates travel across the interconnect.
I've noticed that there is one other person who seems to have seen this problem - http://oss.oracle.com/pipermail/ocfs2-users/2006-July/000612.html but they were on an old version of kernel and fs code. Any idea as to what the underlying cause may be if its not a dropped packet? Would you also mind letting me know what those two line changes were, just for my own interest's sake. Thanks for the quick response. Andy On Thu, 2006-08-03 at 09:44 -0700, Sunil Mushran wrote: > 1. o2net talks tcp. It should be able to handle this. > 2. If the cluster is active and the nodes are communicating, > the keepalive packet is rarely sent. It only sends the packet > if it does not hear from the other node for 5 secs. > 3. Try the same with 1.2.3. (We made 2 important 1 line fixes.) > 4. If this does happen again, and you are interested, we > could always give you a drop that dumps the stack of > all the procs, to get a better feel for the situation. > > Andy Phillips wrote: > > Hello, > > > > Apologies for following up on myself. > > > > in ocfs2/cluster/tcp_internal.h > > #define O2NET_KEEPALIVE_DELAY_SECS 5 > > #define O2NET_IDLE_TIMEOUT_SECS 10 > > > > > > Is this really sensible? Potentially, given small variance in > > system clocks losing one keepalive packet (assuming that > > o2net_sc_send_keep_req is the only thing keeping the connection alive) > > the loss of one packet could cause a node to self fence and reboot. > > > > Would > > #define O2NET_KEEPALIVE_DELAY_SECS 5 > > #define O2NET_IDLE_TIMEOUT_SECS 20 > > > > Cause any problems? > > > > Andy > > > > > > > > On Thu, 2006-08-03 at 12:41 +0100, Andy Phillips wrote: > > > >> Hello, > >> > >> I've a two node 10gR2 rac cluster on a pair of sun opteron boxes. > >> Redhat AS 4.3 2.6.9-34.0.1.ELsmp x86_64. ocfs 1.2.2. RAC is using > >> ASM to talk to the data files, but we have 3 ocfs2 filesystems up > >> to share dba files, and the usual bits and bobs. > >> > >> Things were fine until, on mostly idle system, this happened out > >> of the blue; > >> > >> Aug 2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at > >> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down. > >> Aug 2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some > >> times that might help debug the situation: (tmr 1154545576.798263 now > >> 1154545586.796978 dr 1154545576.798238 adv > >> 1154545576.798291:1154545576.798293 func (06aac8a1:1) > >> 1154545566.800782:1154545566.800787) > >> Aug 2 19:06:27 fred kernel: o2net: no longer connected to node barney > >> (num 0) at 172.16.6.10:7777 > >> Aug 2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR: > >> fencing this node because it is connected to > >> a half-quorum of 1 out of 2 nodes which doesn't include the lowest > >> active node 0 > >> Aug 2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR: > >> stopping heartbeat on all active regions. > >> > >> And the node then halted. > >> > >> Barney is node 0. The systems were idle. We've hammered the ocfs2 > >> file systems, and set o2cb_heartbeat_threshold to 61. All is good and > >> stable under heavy i/o. > >> > >> The interconnect is a bonded interface, with two gig cards, each > >> connected (with flow control on) to two separate FESX424 switches. > >> The switches dont register any problems at this time, nor does linux > >> register any interface issues. > >> > >> I'm looking at the source code at the moment, but nothing is leaping > >> out at me. Any ideas - Do the timer debug lines above mean anything to > >> anyone. > >> > >> Thanks > >> Andy > >> > >> > >> > >> > >> > >> > >> > > ________________________________________________________________________ > In order to protect our email recipients, Betfair use SkyScan from > MessageLabs to scan all Incoming and Outgoing mail for viruses. > > ________________________________________________________________________ -- Andy Phillips, FRAS Systems Architect, Information Systems. Betfair.com Direct Line: 0208 834 8436 Betfair Limited (Company No.5140986), Winslow Road, Hammersmith Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834 8501 (direct). The information in this e-mail and any attachment is confidential, may contain legal advice protected by privilege and is intended only for the named recipient(s). The e-mail may not be disclosed or used by any person other than the addressee, nor may it be copied in any way. If you are not a named recipient please notify the sender immediately and delete any copies of this message. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Any view or opinions presented are solely those of the author and do not necessarily represent those of the company. _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
