Behavior is not difference - if you broke node1-node0 connection, node1 will self-reboot in the current design. It dont matter what exactly you unplug - socket on nod1, socket on node2 or inter-switch connection if it is used.
Add node-3 and everything will change. ----- Original Message ----- From: "Sunil Mushran" <[EMAIL PROTECTED]> To: "Alexei_Roudnev" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[email protected]> Sent: Wednesday, November 15, 2006 11:03 AM Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem > You are missing his point. He is not saying that fencing is the problem. > He is asking as to why the behavior differs between unplugging node 0 > and node 1. > > Alexei_Roudnev wrote: > > It is not a bug; it is all by design. > > > > Problem is that OCFSv2: > > - can't support more than 1 interconnection link, so you always risk to lost > > intercionnection; > > In additional, to make things worst, it dont support serial interconenction; > > - can't find a quorum in 2 node configuration (it's not ocfsv2 problem but > > general concern with any 2 nodes cluster) - > > so all nodes lost quorum if network connection is lost; > > - don't analyze FS activity and reboot all nodes without quorum, except > > node0, in case of losing network connection. > > > > It can't be improved without supporting multiple interconnections + better > > decisions about fencing (there is not any use to fence a node, if it have > > not outstanding IO on cluster file system). > > > > Well known problem with OCFSv2. One solution is to add 3-d node and use > > interface bonding (be sure that interface convergeency time is less that > > o2cb timeout). > > > > > > ----- Original Message ----- > > From: <[EMAIL PROTECTED]> > > To: "Sunil Mushran" <[EMAIL PROTECTED]> > > Cc: <[email protected]> > > Sent: Tuesday, November 14, 2006 10:35 PM > > Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem > > > > > > > >> I decided to rebuild this from scratch today and got the same result. > >> > >> two cluster node, both boxes remain connected to the shared storage > >> throughout tests. > >> > >> I unplug network connection from node0 and get e1000 driver "Tx Unit Hang" > >> messages on node0 console > >> node1 console displays "o2net_idle_timer:1309 here are some times to help > >> debug the situation" followed by additional output > >> node1 sits for a while and eventually displays "o2quo_make_decision:143 > >> error: fencing this node because it is connected to a half-quorum of one > >> > > of > > > >> two nodes which doesn't include the lowest active node 0" > >> node 0 replays node 1's journal, too bad it still isn't on the network > >> > >> this is in node 1 /var/log/messages after reboot > >> > >> Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net > >> (num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it > >> > > down. > > > >> Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some > >> times that might help debug the situation: (tmr 1163570146.656474 now > >> 1163570156.65 > >> 5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func > >> (3a33f0f8:505) 1163570057.403947:1163570057.403950) > >> Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node > >> FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777 > >> > >> I'm confused by this. Shouldn't node 0 have eventually rebooted since it > >> lost network connectivity and node 1 replayed node 0's journal and kept > >> going? As it is right now we are left with no IP reachable box. > >> > >> If I do this same test but unplug node 1 instead of node 0, it works as it > >> should. node 1 will fence and node 0 will reply the journal and stay > >> online. > >> > >> Any input is greatly appreciated. > >> > >> Thanks, > >> > >> Colin Farley > >> Network Administrator > >> E-Care Contact Center Services > >> Phone:(204) 940-6244 > >> Fax:(204) 940-7394 > >> > >> > >> > >> Sunil Mushran > >> <[EMAIL PROTECTED] > >> acle.com> To > >> [EMAIL PROTECTED] > >> 11/13/2006 08:23 cc > >> PM [email protected] > >> Subject > >> Re: [Ocfs2-users] ESX and > >> Unbreakable 2.0 OCFS2 problem > >> > >> > >> > >> > > > > > >> > >> > >> > >> > >> Considering o2net only cares whether it is connected to the other node > >> or not, it should not make a difference whether one unplugs node 0 or > >> node 1. > >> The result should be the same. Node 1 should fence in both cases. > >> > >> Do you see messages indicating that the node(s) have lost connectivity? > >> If so, could you share them. > >> > >> It would be easiest if you could file a bug on oss.oracle.com/bugzilla > >> > > with > > > >> the messages file and listing the course of events... as in, unplugged > >> cable > >> on node 0 at time x, etc. > >> > >> [EMAIL PROTECTED] wrote: > >> > >>> I'm testing a 2 node cluster in a VMWare ESX environment for use as a > >>> > >> high > >> > >>> availability FTP server to support a CRM application. Both nodes run > >>> Unbreakable 2.0 x86_64. They access a 300GB OCFS2 volume on an RDM LUN > >>> > >> on > >> > >>> an HP EVA. All disk connectivity is fine and haven't seen any problems > >>> there. The problem comes when doing some IP failover testing. The IP > >>> failover is done using UCARP so to test failover I tried unplugging one > >>> nodes virtual network cable to see what happens. > >>> > >>> If I unplug node 1 everything is fine, node 1 eventually panics and > >>> > >> reboots > >> > >>> while node 0 chugs along fine. The problem comes when unplugging node > >>> > > 0. > > > >>> When node 0 loses network connectivity it does not panic and eventually > >>> node 1 panics and reboots. Is there a reason why the lower node does > >>> > > not > > > >>> panic if it loses network connectivity? > >>> > >>> Heartbeat thresholds are the same on each node at 31 and both nodes are > >>> > >> set > >> > >>> to reboot on panic, node0 just never panics. All software installed are > >>> versions that come with Unbreakable 2.0. > >>> > >>> I didn't do the config on these boxes so the first thing I'm going to do > >>> > >> on > >> > >>> Tuesday when I work on this is rebuild both nodes from scratch but I > >>> figured I would ask first to see if it was an easy question for someone > >>> > >> on > >> > >>> the list to answer. > >>> > >>> Thanks, > >>> > >>> Colin Farley > >>> Network Administrator > >>> E-Care Contact Center Services > >>> Phone:(204) 940-6244 > >>> Fax:(204) 940-7394 > >>> > >>> > >>> _______________________________________________ > >>> Ocfs2-users mailing list > >>> [email protected] > >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users > >>> > >>> > >> > >> _______________________________________________ > >> Ocfs2-users mailing list > >> [email protected] > >> http://oss.oracle.com/mailman/listinfo/ocfs2-users > >> > >> > > > > > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
