Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Sunil Mushran Wed, 15 Nov 2006 11:12:47 -0800

You are missing his point. He is not saying that fencing is the problem.
He is asking as to why the behavior differs between unplugging node 0
and node 1.


Alexei_Roudnev wrote:

It is not a bug; it is all by design.

Problem is that OCFSv2:
- can't support more than 1 interconnection link, so you always risk to lost
intercionnection;
In additional, to make things worst, it dont support serial interconenction;
- can't find a quorum in 2 node configuration (it's not ocfsv2 problem but
general concern with any 2 nodes cluster) -
 so all nodes lost quorum if network connection is lost;
- don't analyze FS activity and reboot all nodes without quorum, except
node0, in case of losing network connection.

It can't be improved without supporting multiple interconnections + better
decisions about fencing (there is not any use to fence a node, if it have
not outstanding IO on cluster file system).

Well known problem with OCFSv2. One solution is to add 3-d node and use
interface bonding (be sure that interface convergeency time is less that
o2cb timeout).

----- Original Message -----From: <[EMAIL PROTECTED]>

To: "Sunil Mushran" <[EMAIL PROTECTED]>
Cc: <[email protected]>
Sent: Tuesday, November 14, 2006 10:35 PM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

I decided to rebuild this from scratch today and got the same result.

two cluster node, both boxes remain connected to the shared storage
throughout tests.

I unplug network connection from node0 and get e1000 driver "Tx Unit Hang"
messages on node0 console
node1 console displays "o2net_idle_timer:1309 here are some times to help
debug the situation" followed by additional output
node1 sits for a while and eventually displays "o2quo_make_decision:143
error: fencing this node because it is connected to a half-quorum of one

of

two nodes which doesn't include the lowest active node 0"
node 0 replays node 1's journal, too bad it still isn't on the network

this is in node 1 /var/log/messages after reboot

Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net
(num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it

down.

Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1163570146.656474 now
1163570156.65
5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
(3a33f0f8:505) 1163570057.403947:1163570057.403950)
Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777

I'm confused by this.  Shouldn't node 0 have eventually rebooted since it
lost network connectivity and node 1 replayed node 0's journal and kept
going?  As it is right now we are left with no IP reachable box.

If I do this same test but unplug node 1 instead of node 0, it works as it
should. node 1 will fence and node 0 will reply the journal and stay
online.

Any input is greatly appreciated.

Thanks,

Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394



             Sunil Mushran
             <[EMAIL PROTECTED]
             acle.com>                                                  To
                                       [EMAIL PROTECTED]
             11/13/2006 08:23                                           cc
             PM                        [email protected]
                                                                   Subject
                                       Re: [Ocfs2-users] ESX and
                                       Unbreakable 2.0 OCFS2 problem





Considering o2net only cares whether it is connected to the other node
or not, it should not make a difference whether one unplugs node 0 or
node 1.
The result should be the same. Node 1 should fence in both cases.

Do you see messages indicating that the node(s) have lost connectivity?
If so, could you share them.

It would be easiest if you could file a bug on oss.oracle.com/bugzilla

with

the messages file and listing the course of events... as in, unplugged
cable
on node 0 at time x, etc.

[EMAIL PROTECTED] wrote:

I'm testing a 2 node cluster in a VMWare ESX environment for use as a

high

availability FTP server to support a CRM application.  Both nodes run
Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM LUN

on

an HP EVA.  All disk connectivity is fine and haven't seen any problems
there.  The problem comes when doing some IP failover testing.  The IP
failover is done using UCARP so to test failover I tried unplugging one
nodes virtual network cable to see what happens.

If I unplug node 1 everything is fine, node 1 eventually panics and

reboots

while node 0 chugs along fine.  The problem comes when unplugging node

0.

When node 0 loses network connectivity it does not panic and eventually
node 1 panics and reboots.  Is there a reason why the lower node does

not

panic if it loses network connectivity?

Heartbeat thresholds are the same on each node at 31 and both nodes are

set

to reboot on panic, node0 just never panics.  All software installed are
versions that come with Unbreakable 2.0.

I didn't do the config on these boxes so the first thing I'm going to do

on

Tuesday when I work on this is rebuild both nodes from scratch but I
figured I would ask first to see if it was an easy question for someone

on

the list to answer.

Thanks,

Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Reply via email to