I am using Heartbeat 2.0.7 and DRBD 0.7.21 on a CentOS 4.4 system.

I had two problems happen simultaneously. First, and most distressingly, my
cluster tried to fail over for no reason. Second, it failed to successfully
come up on the secondary node.

My primary goal with this cluster is to not have the clustering software
itself increase the failure rate. A failover is an insanely expensive
event. (My bias is heavily towards manual intervention to trigger the
failover, if a choice is to be made).

When the failure happened, this appeared in /var/log/ha-debug on the
primary node. Full logs are attached.

(I had to do some anonymization of hostnames and IP addresses)

lrmd[3999]: 2007/07/20_19:00:41 WARN: on_op_timeout_expired: TIMEOUT:
operation monitor[10] on ocf::IPaddr::CLUSTER6_ip for client 4002, its
parameters: CRM_meta_interval=[5000] ip=[10.10.86.160]
CRM_meta_op_target_rc=[7] CRM_meta_id=[CLUSTER6_ip_mon]
CRM_meta_timeout=[5000] crm_feature_set=[1.0.6] CRM_meta_name=[monitor] .
crmd[4002]: 2007/07/20_19:00:41 ERROR: process_lrm_event:lrm.c LRM
operation (10) monitor_5000 on CLUSTER6_ip Timed Out
(client: 4002, call:64): 0.36.13763 -> 0.36.13764 (ok)

The system was under virtually no load at the time.

Am I correct in assuming that this is the timeout that expired:
            <op id="CLUSTER6_ip_mon" interval="5s" name="monitor"
timeout="5s"/>

If so am I sensible in disabling it completely, and perhaps all other
monitor directives? I am not willing to bet a significant amount of money
that e.g. "ping -c 1 -q -n localhost" will complete in a finite time.

Also, on the way back up, one of the filesystems was busy and could not be
unmounted. But just after the processes were SIGKILLed, the timeout ran
out. I'm not sure which timeout this is, or if increasing it would help.

Filesystem[11709][11744]: 2007/07/20_19:00:44 INFO: Some processes on /apps
were signalled lrmd[3999]: 2007/07/20_19:00:45 inf
(Filesystem_apps:stop:stderr) umount: /apps: device is busy
umount: /apps: device is busy

Filesystem[11709][11747]: 2007/07/20_19:00:45 ERROR: Couldn't unmount /apps;
trying cleanup with SIGKILL
lrmd[3999]: 2007/07/20_19:00:45 info: RA output:
(Filesystem_apps:stop:stdout) /apps:               17922c

Filesystem[11709][11749]: 2007/07/20_19:00:45 INFO: Some processes on /apps
were signalled
tengine[5328]: 2007/07/20_19:00:46 WARN:
global_timer_callback:callbacks.cTimer popped (abort_level=0,
complete=false)
tengine[5328]: 2007/07/20_19:00:46 info:
unconfirmed_actions:callbacks.cWaiting on 1 unconfirmed actions
tengine[5328]: 2007/07/20_19:00:46 WARN:
global_timer_callback:callbacks.cTransition abort timeout reached...
marking transit
ion complete.
tengine[5328]: 2007/07/20_19:00:46 info: update_abort_priority:utils.c Abort
priority upgraded to 1000000
tengine[5328]: 2007/07/20_19:00:46 info: update_abort_priority:utils.c Abort
action 0 superceeded by 2
tengine[5328]: 2007/07/20_19:00:46 WARN:
global_timer_callback:callbacks.cWriting 1 unconfirmed actions to the
CIB
tengine[5328]: 2007/07/20_19:00:46 WARN: cib_action_update:actions.c rsc_op
13: Filesystem_apps_stop_0 on CLUSTER6n1 timed out
crmd[4002]: 2007/07/20_19:00:46 info: do_state_transition:fsa.c CLUSTER6n1:
State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_IPC_MESSAGE origin=route_message ]

Finally, the cluster was left in a half-working state. Is STONITH the only
way I can prevent this from happening? Once it started a failover, I would
have been happy if both nodes rebooted rather than being left in a clearly
half-broken state.

I greatly appreciate any help or insight.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to