On Mon, Apr 2, 2012 at 5:06 PM, Ulrich Windl
<[email protected]> wrote:
>>>> Andrew Beekhof <[email protected]> schrieb am 30.03.2012 um 00:57 in
>>>> Nachricht
> <caedlwg3dqfffq+bgbdqc-fexjo9rhuujtts8_wryn_-kzed...@mail.gmail.com>:
>> On Thu, Mar 29, 2012 at 8:31 PM, Ulrich Windl
>> <[email protected]> wrote:
>> > Hi!
>> >
>> > We had a problem when crmd crashed. Obviously, crmd after being restarted
>> tried to recover, but it seems recovery is not implemented yet:
>>
>> Recovery is implemented, just not graceful recovery without a restart
>> of the process. Which is what you're seeing in the logs.
>
> I'm specifically referring to "crmd: [21618]: ERROR: do_recover: Action
> A_RECOVER (0000000001000000) not supported". And crmd obviously was
> restarted, because the previous PID was 17500.
Yes, because the underlying problem still exists. Because its in the lrmd.
>
> Regards,
> Ulrich
>
>
>> The underlying cause however is that the lrmd, specifically the call
>> we make below, isn't behaving as expected.
>>
>> call_id = rsc->ops->perform_op(rsc, op);
>>
>> if (call_id <= 0) {
>> crm_err("Operation %s on %s failed: %d", operation, rsc->id,
>> call_id);
>> register_fsa_error(C_FSA_INTERNAL, I_FAIL, NULL);
>> ...
>>
>> > kernel: [ 2523.296059] crmd[17500]: segfault at 14 ip 0000000000418110 sp
>> 00007fffe415d370 error 4 in crmd[400000+3a000]
>> > cib: [17496]: WARN: send_via_callback_channel: Delivery of reply to client
>> 17500/8b364949-0abd-40cd-a0cd-8ff9ea184d02 failed
>> > corosync[17457]: [pcmk ] ERROR: pcmk_wait_dispatch: Child process crmd
>> terminated with signal 11 (pid=17500, core=true)
>> > crmd: [21618]: info: do_state_transition: State transition S_NOT_DC ->
>> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
>> > crmd: [21618]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not
>> supported
>> > crmd: [21618]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover()
>> received in state S_RECOVERY
>> > crmd: [21618]: info: do_state_transition: State transition S_RECOVERY ->
>> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
>> > crmd: [21618]: info: do_exit: Performing A_EXIT_0 - gracefully exiting the
>> CRMd
>> >
>> > Unfortunately that strategy was not successful.
>> > corosync[17457]: [pcmk ] notice: pcmk_wait_dispatch: Respawning failed
>> child process: crmd
>> > crmd: [28719]: info: do_state_transition: State transition S_NOT_DC ->
>> S_RECOVERY [ input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
>> > crmd: [28719]: ERROR: do_recover: Action A_RECOVER (0000000001000000) not
>> supported
>> > crmd: [28719]: ERROR: do_log: FSA: Input I_TERMINATE from do_recover()
>> received in state S_RECOVERY
>> >
>> > The game repeated for more than two hours until the other node of the
>> two-node cluster rebooted.
>> >
>> > pengine: [17043]: WARN: pe_fence_node: Node h07 will be fenced because it
>> is un-expectedly down
>> >
>> > Th software bind used is basically SLES11 SP1 with a newer corosync
>> (corosync-1.4.1-0.3.3.3518.1.PTF.712037). Were there any improvements since
>> that on the main development line of corosync/pacemaker?
>> >
>> > Regards,
>> > Ulrich
>> >
>> >
>> > _______________________________________________
>> > Linux-HA mailing list
>> > [email protected]
>> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> > See also: http://linux-ha.org/ReportingProblems
>> _______________________________________________
>> Linux-HA mailing list
>> [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>>
>
>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems