It is not a bug; it is all by design. Problem is that OCFSv2: - can't support more than 1 interconnection link, so you always risk to lost intercionnection; In additional, to make things worst, it dont support serial interconenction; - can't find a quorum in 2 node configuration (it's not ocfsv2 problem but general concern with any 2 nodes cluster) - so all nodes lost quorum if network connection is lost; - don't analyze FS activity and reboot all nodes without quorum, except node0, in case of losing network connection.
It can't be improved without supporting multiple interconnections + better decisions about fencing (there is not any use to fence a node, if it have not outstanding IO on cluster file system). Well known problem with OCFSv2. One solution is to add 3-d node and use interface bonding (be sure that interface convergeency time is less that o2cb timeout). ----- Original Message ----- From: <[EMAIL PROTECTED]> To: "Sunil Mushran" <[EMAIL PROTECTED]> Cc: <[email protected]> Sent: Tuesday, November 14, 2006 10:35 PM Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem > I decided to rebuild this from scratch today and got the same result. > > two cluster node, both boxes remain connected to the shared storage > throughout tests. > > I unplug network connection from node0 and get e1000 driver "Tx Unit Hang" > messages on node0 console > node1 console displays "o2net_idle_timer:1309 here are some times to help > debug the situation" followed by additional output > node1 sits for a while and eventually displays "o2quo_make_decision:143 > error: fencing this node because it is connected to a half-quorum of one of > two nodes which doesn't include the lowest active node 0" > node 0 replays node 1's journal, too bad it still isn't on the network > > this is in node 1 /var/log/messages after reboot > > Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net > (num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it down. > Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some > times that might help debug the situation: (tmr 1163570146.656474 now > 1163570156.65 > 5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func > (3a33f0f8:505) 1163570057.403947:1163570057.403950) > Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node > FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777 > > I'm confused by this. Shouldn't node 0 have eventually rebooted since it > lost network connectivity and node 1 replayed node 0's journal and kept > going? As it is right now we are left with no IP reachable box. > > If I do this same test but unplug node 1 instead of node 0, it works as it > should. node 1 will fence and node 0 will reply the journal and stay > online. > > Any input is greatly appreciated. > > Thanks, > > Colin Farley > Network Administrator > E-Care Contact Center Services > Phone:(204) 940-6244 > Fax:(204) 940-7394 > > > > Sunil Mushran > <[EMAIL PROTECTED] > acle.com> To > [EMAIL PROTECTED] > 11/13/2006 08:23 cc > PM [email protected] > Subject > Re: [Ocfs2-users] ESX and > Unbreakable 2.0 OCFS2 problem > > > > > > > > > > Considering o2net only cares whether it is connected to the other node > or not, it should not make a difference whether one unplugs node 0 or > node 1. > The result should be the same. Node 1 should fence in both cases. > > Do you see messages indicating that the node(s) have lost connectivity? > If so, could you share them. > > It would be easiest if you could file a bug on oss.oracle.com/bugzilla with > the messages file and listing the course of events... as in, unplugged > cable > on node 0 at time x, etc. > > [EMAIL PROTECTED] wrote: > > I'm testing a 2 node cluster in a VMWare ESX environment for use as a > high > > availability FTP server to support a CRM application. Both nodes run > > Unbreakable 2.0 x86_64. They access a 300GB OCFS2 volume on an RDM LUN > on > > an HP EVA. All disk connectivity is fine and haven't seen any problems > > there. The problem comes when doing some IP failover testing. The IP > > failover is done using UCARP so to test failover I tried unplugging one > > nodes virtual network cable to see what happens. > > > > If I unplug node 1 everything is fine, node 1 eventually panics and > reboots > > while node 0 chugs along fine. The problem comes when unplugging node 0. > > When node 0 loses network connectivity it does not panic and eventually > > node 1 panics and reboots. Is there a reason why the lower node does not > > panic if it loses network connectivity? > > > > Heartbeat thresholds are the same on each node at 31 and both nodes are > set > > to reboot on panic, node0 just never panics. All software installed are > > versions that come with Unbreakable 2.0. > > > > I didn't do the config on these boxes so the first thing I'm going to do > on > > Tuesday when I work on this is rebuild both nodes from scratch but I > > figured I would ask first to see if it was an easy question for someone > on > > the list to answer. > > > > Thanks, > > > > Colin Farley > > Network Administrator > > E-Care Contact Center Services > > Phone:(204) 940-6244 > > Fax:(204) 940-7394 > > > > > > _______________________________________________ > > Ocfs2-users mailing list > > [email protected] > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
