>>> Andrew Beekhof <[email protected]> schrieb am 30.03.2012 um 00:57 in
>>> Nachricht
<caedlwg3dqfffq+bgbdqc-fexjo9rhuujtts8_wryn_-kzed...@mail.gmail.com>:
> On Thu, Mar 29, 2012 at 8:31 PM, Ulrich Windl
> <[email protected]> wrote:
> > Hi!
> >
> > We had a problem when crmd crashed. Obviously, crmd after being restarted
> tried to recover, but it seems recovery is not implemented yet:
>
> Recovery is implemented, just not graceful recovery without a restart
> of the process. Which is what you're seeing in the logs.
I'm specifically referring to "crmd: [21618]: ERROR: do_recover: Action
A_RECOVER (0000000001000000) not supported". And crmd obviously was restarted,
because the previous PID was 17500.
Regards,
Ulrich
> The underlying cause however is that the lrmd, specifically the call
> we make below, isn't behaving as expected.
>
> call_id = rsc->ops->perform_op(rsc, op);
>
> if (call_id <= 0) {
> crm_err("Operation %s on %s failed: %d", operation, rsc->id, call_id);
> register_fsa_error(C_FSA_INTERNAL, I_FAIL, NULL);
> ...
>
> > kernel: [ 2523.296059] crmd[17500]: segfault at 14 ip 0000000000418110 sp
> 00007fffe415d370 error 4 in crmd[400000+3a000]
> > cib: [17496]: WARN: send_via_callback_channel: Delivery of reply to client
> 17500/8b364949-0abd-40cd-a0cd-8ff9ea184d02 failed
> > corosync[17457]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process crmd
> terminated with signal 11 (pid=17500, core=true)
> > crmd: [21618]: info: do_state_transition: State transition S_NOT_DC ->
> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
> > crmd: [21618]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not
> supported
> > crmd: [21618]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover()
> received in state S_RECOVERY
> > crmd: [21618]: info: do_state_transition: State transition S_RECOVERY ->
> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
> > crmd: [21618]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the
> CRMd
> >
> > Unfortunately that strategy was not successful.
> > corosync[17457]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed
> child process: crmd
> > crmd: [28719]: info: do_state_transition: State transition S_NOT_DC ->
> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
> > crmd: [28719]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not
> supported
> > crmd: [28719]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover()
> received in state S_RECOVERY
> >
> > The game repeated for more than two hours until the other node of the
> two-node cluster rebooted.
> >
> > pengine: [17043]: WARN: pe_fence_node: Node h07 will be fenced because it
> is un-expectedly down
> >
> > Th software bind used is basically SLES11 SP1 with a newer corosync
> (corosync-1.4.1-0.3.3.3518.1.PTF.712037). Were there any improvements since
> that on the main development line of corosync/pacemaker?
> >
> > Regards,
> > Ulrich
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems