> The root cause seems to be that heartbeat is not providing client
> status messages (to say that the crmd processes are active) once the
> split-brain heals.
> 
> crmd[1350]: 2007/11/08_10:38:43 info: join_make_offer: Peer process on
> dl380g5c is not active (yet?)
> crmd[1350]: 2007/11/08_10:40:11 WARN: do_state_transition: Only 1 of 2
> cluster nodes are eligible to run resources - continue 0
> 
> Because of this, the crm doesn't consider dl380g5c online and the PE
> can't shut it down.
> 
> 
> I think you need to file a bug for alan about this.

I found the similar case.
During recovering from a split brain, 
one node could not join the membership after all.

crmd[6657]: 2007/11/15_14:04:11 debug: crmd_ha_msg_callback: Ignoring HA
message (op=noop) from prec370d: not in our membership list (size=1)

and loop its State transition, 
from S_FINALIZE_JOIN -> S_INTEGRATION to S_INTEGRATION -> S_FINALIZE_JOIN
and so on.

even worse the system was reboot for unexplained reasons...

Message from [EMAIL PROTECTED] at Thu Nov 15 14:06:03 2007 ...
prec370d heartbeat: [2572]: EMERG: Rebooting system.  Reason:
/usr/lib64/heartbeat/crmd

I think crmd is not the underlying cause of this case...
this case is poorly-reproducible, seems to be a matter of timing.

The logs were very big, so filed them here;
http://developerbugs.linux-foundation.org//show_bug.cgi?id=1779

Thanks,
Junko



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to