On Thu, Mar 29, 2012 at 8:31 PM, Ulrich Windl
<[email protected]> wrote:
> Hi!
>
> We had a problem when crmd crashed. Obviously, crmd after being restarted
> tried to recover, but it seems recovery is not implemented yet:
Recovery is implemented, just not graceful recovery without a restart
of the process. Which is what you're seeing in the logs.
The underlying cause however is that the lrmd, specifically the call
we make below, isn't behaving as expected.
call_id = rsc->ops->perform_op(rsc, op);
if (call_id <= 0) {
crm_err("Operation %s on %s failed: %d", operation, rsc->id, call_id);
register_fsa_error(C_FSA_INTERNAL, I_FAIL, NULL);
...
> kernel: [ 2523.296059] crmd[17500]: segfault at 14 ip 0000000000418110 sp
> 00007fffe415d370 error 4 in crmd[400000+3a000]
> cib: [17496]: WARN: send_via_callback_channel: Delivery of reply to client
> 17500/8b364949-0abd-40cd-a0cd-8ff9ea184d02 failed
> corosync[17457]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process crmd
> terminated with signal 11 (pid=17500, core=true)
> crmd: [21618]: info: do_state_transition: State transition S_NOT_DC ->
> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
> crmd: [21618]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not
> supported
> crmd: [21618]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover()
> received in state S_RECOVERY
> crmd: [21618]: info: do_state_transition: State transition S_RECOVERY ->
> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
> crmd: [21618]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the
> CRMd
>
> Unfortunately that strategy was not successful.
> corosync[17457]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed child
> process: crmd
> crmd: [28719]: info: do_state_transition: State transition S_NOT_DC ->
> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
> crmd: [28719]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not
> supported
> crmd: [28719]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover()
> received in state S_RECOVERY
>
> The game repeated for more than two hours until the other node of the
> two-node cluster rebooted.
>
> pengine: [17043]: WARN: pe_fence_node: Node h07 will be fenced because it is
> un-expectedly down
>
> Th software bind used is basically SLES11 SP1 with a newer corosync
> (corosync-1.4.1-0.3.3.3518.1.PTF.712037). Were there any improvements since
> that on the main development line of corosync/pacemaker?
>
> Regards,
> Ulrich
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems