Re: [Pacemaker] Reason for cluster resource migration

Andrew Beekhof Tue, 12 Feb 2013 21:00:24 -0800

On Wed, Feb 13, 2013 at 3:52 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> On Wed, Feb 13, 2013 at 2:04 AM, Andrew Martin <amar...@xes-inc.com> wrote:
>> ----- Original Message -----
>>> From: "Andrew Beekhof" <and...@beekhof.net>
>>> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
>>> Sent: Monday, February 11, 2013 10:11:53 PM
>>> Subject: Re: [Pacemaker] Reason for cluster resource migration
>>>
>>> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof <and...@beekhof.net>
>>> wrote:
>>> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof
>>> > <and...@beekhof.net> wrote:
>>> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin
>>> >> <amar...@xes-inc.com> wrote:
>>> >>> Hello,
>>> >>>
>>> >>> Unfortunately this same failure occurred again tonight,
>>> >>
>>> >> It might be the same effect, but there was no indication that the
>>> >> PE
>>> >> died last time.
>>> >>
>>> >>> taking down a production cluster. Here is the part of the log
>>> >>> where pengine died:
>>> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice:
>>> >>> pcmk_child_exit: Child process pengine terminated with signal 6
>>> >>> (pid=19357, core=128)
>>> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice:
>>> >>> pcmk_child_exit: Respawning failed child process: pengine
>>> >>> Feb 11 17:05:16 storage0 pengine[12660]:   notice:
>>> >>> crm_add_logfile: Additional logging available in
>>> >>> /var/log/corosync.log
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read:
>>> >>> Connection to pengine failed
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error:
>>> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed
>>> >>> (I/O condition=25)
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:     crit: pe_ipc_destroy:
>>> >>> Connection to the Policy Engine failed (pid=-1,
>>> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
>>> >>> save_cib_contents: Saved CIB contents after PE crash to
>>> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.
>>> >>>  bz2
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
>>> >>> Input I_ERROR from save_cib_contents() received in state
>>> >>> S_POLICY_ENGINE
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning:
>>> >>> do_state_transition: State transition S_POLICY_ENGINE ->
>>> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL
>>> >>> origin=save_cib_contents ]
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover:
>>> >>> Action A_RECOVER (0000000001000000) not supported
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning: do_election_vote:
>>> >>> Not voting in election, we're in state S_RECOVERY
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
>>> >>> Input I_TERMINATE from do_recover() received in state S_RECOVERY
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
>>> >>> terminate_cs_connection: Disconnecting from Corosync
>>> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could
>>> >>> not recover from internal error
>>> >>>
>>> >>> The rest of the log:
>>> >>> http://sources.xes-inc.com/downloads/pengine.log
>>> >>> Looking through the full log, it seems that pengine recovers,
>>> >>
>>> >> Right, pacemakerd watches for this and restarts it.
>>> >>
>>> >>> but perhaps not quickly enough to prevent the STONITH and
>>> >>> resource migration?
>>> >>
>>> >> Highly likely.
>>> >> However the PE crashing is quite serious.  I'd like to get to the
>>> >> bottom of that ASAP.
>>> >>
>>> >>>
>>> >>> Here is the pe-core dump file mentioned in the log:
>>> >>> http://sources.xes-inc.com/downloads/pe-core.bz2
>>> >>
>>> >> Unfortunately core files are specific to the machine that
>>> >> generated them.
>>> >> If you create a crm_report for about that time, it will open it
>>> >> and
>>> >> record a backtrace for us to look at.
>>> >>
>>> >> Also very important is the contents of:
>>> >>    
>>> >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2
>>> >
>>> > Ohhh, thats what the pe-core link was.
>>> > I've run it through crm_simulate but couldn't reproduce the crash.
>>> >
>>> > So we'll still need the crm_report, it will have more detail on the
>>> > "Child process pengine terminated with signal 6 (pid=19357,
>>> > core=128)"
>>> > part.
>>>
>>> Signal 6 is an assertion failure, but strangely there is no mention
>>> of
>>> one in syslog.
>>> Can you grep /var/log/corosync.log for lines containing 19357 please?
>>>
>> Andrew,
>>
>> Thanks for the help. Here are the lines containing 19357:
>> http://sources.xes-inc.com/downloads/19357.log
>> cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource. Postfix
>> is installed and running, so I am not sure why these failures are occurring.
>>
>>> > The core file will likely be somewhere under
>>> > /var/lib/pacemaker/cores
>> That directory doesn't exist on this server, and it doesn't appear to be in 
>> /var/crash either:
>
> It looks like /var/lib/heartbeat/cores/ on your system.
>
>> # ls /var/crash/ -ltr
>> total 67548
>> -rw-r----- 1 hacluster whoopsie  1293711 Feb  6 10:01 
>> _usr_libexec_pacemaker_pengine.110.crash
>> ---------- 1 root      whoopsie 67874816 Feb 11 17:07 
>> _usr_libexec_pacemaker_lrmd.0.crash
>> In case they would be helpful, here are those two files:
>> http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash
>> http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash
>>
>> Here is the crm_report from storage0 from this time period:
>> http://sources.xes-inc.com/downloads/pengine-report.tar.bz2
>
> Are you sure?
> The pengine crashed on "Feb 11 17:05:15" but the report appears to be
> from "Tue Feb 12 09:59:50 EST 2013" to "Tue Feb 12 10:30:10 EST 2013"
>
> There was one crash in there, but it was of the lrmd.
> Unfortunately it looks like the binaries and libraries have been stripped.
>
> Where did you get them from?  Do you know how to install the -debug packages?


This link has some useful info:

https://wiki.ubuntu.com/DebuggingProgramCrash#Debug_Symbol_Packages

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Reason for cluster resource migration

Reply via email to