> The root cause seems to be that heartbeat is not providing client > status messages (to say that the crmd processes are active) once the > split-brain heals. > > crmd[1350]: 2007/11/08_10:38:43 info: join_make_offer: Peer process on > dl380g5c is not active (yet?) > crmd[1350]: 2007/11/08_10:40:11 WARN: do_state_transition: Only 1 of 2 > cluster nodes are eligible to run resources - continue 0 > > Because of this, the crm doesn't consider dl380g5c online and the PE > can't shut it down. > > > I think you need to file a bug for alan about this.
I found the similar case. During recovering from a split brain, one node could not join the membership after all. crmd[6657]: 2007/11/15_14:04:11 debug: crmd_ha_msg_callback: Ignoring HA message (op=noop) from prec370d: not in our membership list (size=1) and loop its State transition, from S_FINALIZE_JOIN -> S_INTEGRATION to S_INTEGRATION -> S_FINALIZE_JOIN and so on. even worse the system was reboot for unexplained reasons... Message from [EMAIL PROTECTED] at Thu Nov 15 14:06:03 2007 ... prec370d heartbeat: [2572]: EMERG: Rebooting system. Reason: /usr/lib64/heartbeat/crmd I think crmd is not the underlying cause of this case... this case is poorly-reproducible, seems to be a matter of timing. The logs were very big, so filed them here; http://developerbugs.linux-foundation.org//show_bug.cgi?id=1779 Thanks, Junko _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
