Austin Rock wrote: > Config files are already attach on this mail. I am attaching the log files > on this mail. Kindly look into it and let me know what is the problem.
I see you finally (after 5 days) included the logs we asked you for on the first day. As I said, the logs are ESSENTIAL. They are NOT optional. You'll see what I mean below. Did you look at your logs yourself? If you had, you would have asked different questions. From the fact you apparently didn't look at your logs, I would judge that you're not an experienced Linux system adminstrator. Honestly, HA is not a beginner topic for an SysAdmin. If you don't have experience in running machines and watching logs, then this is probably not the place to learn those skills. We're generally nice people, but we can't teach you the basics of administering Linux systems. There are lots of other places to learn that. Your logs are just _full_ of ERROR: messages. The rule is -- if it says ERROR, then that's bad. 99.5% of the time it's bad. 298 ERROR messages in the master log. 25 ERROR messages in the slave log. Even one ERROR message is too many. We don't use ERROR: for things that might happen under normal circumstances. You were asking the wrong question. The question isn't "why does the master take so long to take resources back"? The right question is: "Why is my configuration totally broken"? Or, "Why is my log just FULL of serious looking ERROR messages". And, even worse, there CRIT messages. So, let's look at a few of the error messages you got... A few from hamaster: heartbeat[15836]: 2007/04/02_02:20:11 ERROR: Cannot rexmit pkt 124 for haslave: seqno too low heartbeat[15836]: 2007/04/02_02:20:16 ERROR: Message hist queue is filling up (151 messages in queue) heartbeat[15836]: 2007/04/02_02:21:24 CRIT: Cluster node haslave returning after partition. heartbeat[15836]: 2007/04/02_02:21:24 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain heartbeat[15836]: 2007/04/02_02:21:28 WARN: Deadtime value may be too small. heartbeat[15836]: 2007/04/02_02:21:28 info: See FAQ for information on tuning deadtime. heartbeat[15836]: 2007/04/02_02:21:28 info: URL: http://linux-ha.org/FAQ#heavy_load Did you follow these URLs and read them? A few from haslave: heartbeat[1987]: 2007/04/02_11:49:58 ERROR: Message hist queue is filling up (151 messages in queue) heartbeat[2956]: 2007/04/02_11:50:25 ERROR: Irretrievably lost packet: node hamaster seq 124 heartbeat[1987]: 2007/04/02_11:49:49 CRIT: Cluster node hamaster returning after partition. heartbeat[1987]: 2007/04/02_11:49:49 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain heartbeat[1987]: 2007/04/02_11:49:49 WARN: Deadtime value may be too small. heartbeat[1987]: 2007/04/02_11:49:49 info: See FAQ for information on tuning deadtime. heartbeat[1987]: 2007/04/02_11:49:49 info: URL: http://linux-ha.org/FAQ#heavy_load The first thing I notice is that the timestamps between the two machines don't seem AT ALL similar. They appear to differ by more than 9 hours. When you're diagnosing problems, it's REALLY helpful to have the two machines' clocks in sync. We highly recommend using xntpd for this. So, the next thing I notice is that these message all have to do with lost packets, and broken communication. So, the usual problem with this in 90+% of the cases is that you have a firewall enabled in your Linux machine. Yes, you probably do. And when you come back and say you've turned it off, you probably didn't. This is how this particular problem usually turns out. And, you need to make sure it's off on both systems -- not just one. So, turn off your firewalls (if you have any) and try again. Here's how to tell if you succeeded in turning off your firewall: sudo /usr/sbin/iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination If you run iptables, and your output looks like this, then your firewall is off ON THAT MACHINE. You also need to be sure that it doesn't re-enable itself after a reboot. It is possible that this isn't the problem. See my next paragraph... Also, the right way to test a failure is NOT to unplug ethernet cables, but to reboot machines. You need to have redundant communication (which was mentioned several times in the documentation you were asked to read). Once you create a split-brain, you more or less get what you deserve. According to the URL which you no doubt read: A split-brain condition is the result of a ClusterPartition, where each side believes the other is dead, and then proceeds to take over resources as though the other side no longer owned any resources. After this, a variety of BadThingsWillHappen - including destroying shared disk data. Sounds like the web page was right. Bad Things Happened. We're going to try and get you to an initially working cluster, but since you don't appear to be at all experienced in managing Linux systems, it will likely be painful, and it won't replace all the knowledge you appear to lack at this point. I'm not trying to be insulting, and if I've insulted you, please forgive me. I'm just trying to be realistic. You really do need to know how to read logs, read documentation, and manage a Linux system before you try and make one highly-available. -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems