Austin Rock wrote:
> Config files are already attach on this mail.  I am attaching the log files
> on this mail. Kindly look into it and let me know what is the problem.

I see you finally (after 5 days) included the logs we asked you for on
the first day.  As I said, the logs are ESSENTIAL.  They are NOT
optional.  You'll see what I mean below.

Did you look at your logs yourself?  If you had, you would have asked
different questions.  From the fact you apparently didn't look at your
logs, I would judge that you're not an experienced Linux system
adminstrator.  Honestly, HA is not a beginner topic for an SysAdmin.  If
you don't have experience in running machines and watching logs, then
this is probably not the place to learn those skills.  We're generally
nice people, but we can't teach you the basics of administering Linux
systems.  There are lots of other places to learn that.

Your logs are just _full_ of ERROR: messages.  The rule is -- if it says
ERROR, then that's bad.  99.5% of the time it's bad.

298 ERROR messages in the master log.

25 ERROR messages in the slave log.

Even one ERROR message is too many.  We don't use ERROR: for things that
might happen under normal circumstances.

You were asking the wrong question.  The question isn't "why does the
master take so long to take resources back"?  The right question is:
"Why is my configuration totally broken"?  Or, "Why is my log just FULL
of serious looking ERROR messages".  And, even worse, there CRIT messages.

So, let's look at a few of the error messages you got...

A few from hamaster:
heartbeat[15836]: 2007/04/02_02:20:11 ERROR: Cannot rexmit pkt 124 for
haslave: seqno too low
heartbeat[15836]: 2007/04/02_02:20:16 ERROR: Message hist queue is
filling up (151 messages in queue)
heartbeat[15836]: 2007/04/02_02:21:24 CRIT: Cluster node haslave
returning after partition.
heartbeat[15836]: 2007/04/02_02:21:24 info: For information on cluster
partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[15836]: 2007/04/02_02:21:28 WARN: Deadtime value may be too small.
heartbeat[15836]: 2007/04/02_02:21:28 info: See FAQ for information on
tuning deadtime.
heartbeat[15836]: 2007/04/02_02:21:28 info: URL:
http://linux-ha.org/FAQ#heavy_load


Did you follow these URLs and read them?



A few from haslave:
heartbeat[1987]: 2007/04/02_11:49:58 ERROR: Message hist queue is
filling up (151 messages in queue)
heartbeat[2956]: 2007/04/02_11:50:25 ERROR: Irretrievably lost packet:
node hamaster seq 124
heartbeat[1987]: 2007/04/02_11:49:49 CRIT: Cluster node hamaster
returning after partition.
heartbeat[1987]: 2007/04/02_11:49:49 info: For information on cluster
partitions, See URL: http://linux-ha.org/SplitBrain
heartbeat[1987]: 2007/04/02_11:49:49 WARN: Deadtime value may be too small.
heartbeat[1987]: 2007/04/02_11:49:49 info: See FAQ for information on
tuning deadtime.
heartbeat[1987]: 2007/04/02_11:49:49 info: URL:
http://linux-ha.org/FAQ#heavy_load


The first thing I notice is that the timestamps between the two machines
don't seem AT ALL similar.  They appear to differ by more than 9 hours.
 When you're diagnosing problems, it's REALLY helpful to have the two
machines' clocks in sync.  We highly recommend using xntpd for this.

So, the next thing I notice is that these message all have to do with
lost packets, and broken communication.

So, the usual problem with this in 90+% of the cases is that you have a
firewall enabled in your Linux machine.  Yes, you probably do.  And when
you come back and say you've turned it off, you probably didn't.  This
is how this particular problem usually turns out.  And, you need to make
sure it's off on both systems -- not just one.

So, turn off your firewalls (if you have any) and try again.

Here's how to tell if you succeeded in turning off your firewall:
sudo /usr/sbin/iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

If you run iptables, and your output looks like this, then your firewall
is off ON THAT MACHINE.

You also need to be sure that it doesn't re-enable itself after a reboot.

It is possible that this isn't the problem.  See my next paragraph...

Also, the right way to test a failure is NOT to unplug ethernet cables,
but to reboot machines.  You need to have redundant communication (which
was mentioned several times in the documentation you were asked to read).

Once you create a split-brain, you more or less get what you deserve.
According to the URL which you no doubt read:
        A split-brain condition is the result of a ClusterPartition,
        where each side believes the other is dead, and then
        proceeds to take over resources as though the other side
        no longer owned any resources.

        After this, a variety of BadThingsWillHappen - including
        destroying shared disk data.

Sounds like the web page was right.  Bad Things Happened.

We're going to try and get you to an initially working cluster, but
since you don't appear to be at all experienced in managing Linux
systems, it will likely be painful, and it won't replace all the
knowledge you appear to lack at this point.  I'm not trying to be
insulting, and if I've insulted you, please forgive me.  I'm just trying
to be realistic.  You really do need to know how to read logs, read
documentation, and manage a Linux system before you try and make one
highly-available.


-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to