Again, read his email.
Alexei_Roudnev wrote:
Behavior is not difference - if you broke node1-node0 connection, node1 will
self-reboot in the current design.
It dont matter what exactly you unplug - socket on nod1, socket on node2 or
inter-switch connection if it is used.
Add node-3 and everything will change.
----- Original Message -----
From: "Sunil Mushran" <[EMAIL PROTECTED]>
To: "Alexei_Roudnev" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[email protected]>
Sent: Wednesday, November 15, 2006 11:03 AM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem
You are missing his point. He is not saying that fencing is the problem.
He is asking as to why the behavior differs between unplugging node 0
and node 1.
Alexei_Roudnev wrote:
It is not a bug; it is all by design.
Problem is that OCFSv2:
- can't support more than 1 interconnection link, so you always risk to
lost
intercionnection;
In additional, to make things worst, it dont support serial
interconenction;
- can't find a quorum in 2 node configuration (it's not ocfsv2 problem
but
general concern with any 2 nodes cluster) -
so all nodes lost quorum if network connection is lost;
- don't analyze FS activity and reboot all nodes without quorum, except
node0, in case of losing network connection.
It can't be improved without supporting multiple interconnections +
better
decisions about fencing (there is not any use to fence a node, if it
have
not outstanding IO on cluster file system).
Well known problem with OCFSv2. One solution is to add 3-d node and use
interface bonding (be sure that interface convergeency time is less that
o2cb timeout).
----- Original Message -----
From: <[EMAIL PROTECTED]>
To: "Sunil Mushran" <[EMAIL PROTECTED]>
Cc: <[email protected]>
Sent: Tuesday, November 14, 2006 10:35 PM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem
I decided to rebuild this from scratch today and got the same result.
two cluster node, both boxes remain connected to the shared storage
throughout tests.
I unplug network connection from node0 and get e1000 driver "Tx Unit
Hang"
messages on node0 console
node1 console displays "o2net_idle_timer:1309 here are some times to
help
debug the situation" followed by additional output
node1 sits for a while and eventually displays "o2quo_make_decision:143
error: fencing this node because it is connected to a half-quorum of
one
of
two nodes which doesn't include the lowest active node 0"
node 0 replays node 1's journal, too bad it still isn't on the network
this is in node 1 /var/log/messages after reboot
Nov 14 23:55:56 FTP02 kernel: o2net: connection to node
FTP01.mydomain.net
(num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it
down.
Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1163570146.656474 now
1163570156.65
5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
(3a33f0f8:505) 1163570057.403947:1163570057.403950)
Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777
I'm confused by this. Shouldn't node 0 have eventually rebooted since
it
lost network connectivity and node 1 replayed node 0's journal and kept
going? As it is right now we are left with no IP reachable box.
If I do this same test but unplug node 1 instead of node 0, it works as
it
should. node 1 will fence and node 0 will reply the journal and stay
online.
Any input is greatly appreciated.
Thanks,
Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394
Sunil Mushran
<[EMAIL PROTECTED]
acle.com>
To
[EMAIL PROTECTED]
11/13/2006 08:23
cc
PM [email protected]
Subject
Re: [Ocfs2-users] ESX and
Unbreakable 2.0 OCFS2 problem
Considering o2net only cares whether it is connected to the other node
or not, it should not make a difference whether one unplugs node 0 or
node 1.
The result should be the same. Node 1 should fence in both cases.
Do you see messages indicating that the node(s) have lost connectivity?
If so, could you share them.
It would be easiest if you could file a bug on oss.oracle.com/bugzilla
with
the messages file and listing the course of events... as in, unplugged
cable
on node 0 at time x, etc.
[EMAIL PROTECTED] wrote:
I'm testing a 2 node cluster in a VMWare ESX environment for use as a
high
availability FTP server to support a CRM application. Both nodes run
Unbreakable 2.0 x86_64. They access a 300GB OCFS2 volume on an RDM
LUN
on
an HP EVA. All disk connectivity is fine and haven't seen any
problems
there. The problem comes when doing some IP failover testing. The IP
failover is done using UCARP so to test failover I tried unplugging
one
nodes virtual network cable to see what happens.
If I unplug node 1 everything is fine, node 1 eventually panics and
reboots
while node 0 chugs along fine. The problem comes when unplugging node
0.
When node 0 loses network connectivity it does not panic and
eventually
node 1 panics and reboots. Is there a reason why the lower node does
not
panic if it loses network connectivity?
Heartbeat thresholds are the same on each node at 31 and both nodes
are
set
to reboot on panic, node0 just never panics. All software installed
are
versions that come with Unbreakable 2.0.
I didn't do the config on these boxes so the first thing I'm going to
do
on
Tuesday when I work on this is rebuild both nodes from scratch but I
figured I would ask first to see if it was an easy question for
someone
on
the list to answer.
Thanks,
Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users