Hi Dejan,

Thank you for your reply.

Hi Satoshi-san,

On Tue, Sep 09, 2008 at 04:31:25PM +0900, OKADA Satoshi wrote:
Hi,

I got unexpected ERROR message when I tested Heartbeat process failure.

ha.cf:
-----
crm on
use_logd on
keepalive 1
deadtime 10
initdead 40
warntime 5
udpport 694
bcast eth0
node node01
node node02
watchdog /dev/watchdog
-----

heartbeat version: 2.1.4
OS version: RHEL 5.1

The test procedure:
1. start heartbeat
# /etc/init.d/heartbeat start

2. kill heartbeat process
# kill -9 <"heartbeat: write" or "heartbeat: read" process>
These processes are restarted.

3. stop heartbeat
# /etc/init.d/heartbeat stop

I get ERROR message in this stop process.
---- ha-log -----
heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog write
magic character failure: closing /dev/watchdog!: Bad file descriptor
heartbeat[4632]: 2008/09/09_14:43:41 ERROR: Watchdog close(2)
failed.: Bad file descriptor
-----------------

I think that this is the same cause as Bugzilla No.1702 and I make patch.
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1702

Please check attached patch.

Sorry for the delay on this one.

Your patch looks fine to me. Did you test it?


Yes.

I tested some operations, and checked logs and resources
status by usingcrm_mon. I was not able to find the problem.


---
the outline of test:
 Two node (Active-Standby)
 watchdog directive in ha.cf
 resources:rscGroup(IPaddr, pgsq, Filesystem)

  1. I tested the behavior of the Heartbeat when target processes did not down.
    Target processes are "FIFO reader", "write bcast", "read bcast",
    "write ping" and "read ping".
    1-1 resources fails, and fail-over.
    1-2 ping communication fails, and fail-over.
    1-3 master control process killed, and node is rebooted by watchdog.
    1-4 run Heartbeat continuously for about one hour.

  2. I tested the behavior of the Heartbeat when target processes down.
    2-1 target processes killed and restarted these processes.
        Afterwards, resources fails, and fail-over.
    2-2 "read ping" and "write ping" processes killed.
        Afterwards, ping communicatin fails and fail-over.
    2-3 Target process killed and restearted processes.
        Afterwards, run Heartbeat continuously for about one hour.



Best Regards,

OKADA Satoshi
NTT Open Source Software Center
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to