Hi Dejan,
Thank you for your reply.
Hi Satoshi-san,
On Tue, Sep 09, 2008 at 04:31:25PM +0900, OKADA Satoshi wrote:
Hi,
I got unexpected ERROR message when I tested Heartbeat process failure.
ha.cf:
-----
crm on
use_logd on
keepalive 1
deadtime 10
initdead 40
warntime 5
udpport 694
bcast eth0
node node01
node node02
watchdog /dev/watchdog
-----
heartbeat version: 2.1.4
OS version: RHEL 5.1
The test procedure:
1. start heartbeat
# /etc/init.d/heartbeat start
2. kill heartbeat process
# kill -9 <"heartbeat: write" or "heartbeat: read" process>
These processes are restarted.
3. stop heartbeat
# /etc/init.d/heartbeat stop
I get ERROR message in this stop process.
---- ha-log -----
heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog write
magic character failure: closing /dev/watchdog!: Bad file descriptor
heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog close(2)
failed.: Bad file descriptor
-----------------
I think that this is the same cause as Bugzilla No.1702 and I make patch.
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1702
Please check attached patch.
Sorry for the delay on this one.
Your patch looks fine to me. Did you test it?
Yes.
I tested some operations, and checked logs and resources
status by usingcrm_mon. I was not able to find the problem.
---
the outline of test:
Two node (Active-Standby)
watchdog directive in ha.cf
resources:rscGroup(IPaddr, pgsq, Filesystem)
1. I tested the behavior of the Heartbeat when target processes did not down.
Target processes are "FIFO reader", "write bcast", "read bcast",
"write ping" and "read ping".
1-1 resources fails, and fail-over.
1-2 ping communication fails, and fail-over.
1-3 master control process killed, and node is rebooted by watchdog.
1-4 run Heartbeat continuously for about one hour.
2. I tested the behavior of the Heartbeat when target processes down.
2-1 target processes killed and restarted these processes.
Afterwards, resources fails, and fail-over.
2-2 "read ping" and "write ping" processes killed.
Afterwards, ping communicatin fails and fail-over.
2-3 Target process killed and restearted processes.
Afterwards, run Heartbeat continuously for about one hour.
Best Regards,
OKADA Satoshi
NTT Open Source Software Center
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/