Re: [Pacemaker] Reason for cluster resource migration

Andrew Beekhof Wed, 13 Feb 2013 14:01:34 -0800

On Thu, Feb 14, 2013 at 4:28 AM, Andrew Martin <amar...@xes-inc.com> wrote:
> ----- Original Message -----
>> From: "Andrew Beekhof" <and...@beekhof.net>
>> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
>> Sent: Tuesday, February 12, 2013 10:52:23 PM
>> Subject: Re: [Pacemaker] Reason for cluster resource migration
>>
>> On Wed, Feb 13, 2013 at 2:04 AM, Andrew Martin <amar...@xes-inc.com>
>> wrote:
>> > ----- Original Message -----
>> >> From: "Andrew Beekhof" <and...@beekhof.net>
>> >> To: "The Pacemaker cluster resource manager"
>> >> <pacemaker@oss.clusterlabs.org>
>> >> Sent: Monday, February 11, 2013 10:11:53 PM
>> >> Subject: Re: [Pacemaker] Reason for cluster resource migration
>> >>
>> >> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof
>> >> <and...@beekhof.net>
>> >> wrote:
>> >> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof
>> >> > <and...@beekhof.net> wrote:
>> >> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin
>> >> >> <amar...@xes-inc.com> wrote:
>> >> >>> Hello,
>> >> >>>
>> >> >>> Unfortunately this same failure occurred again tonight,
>> >> >>
>> >> >> It might be the same effect, but there was no indication that
>> >> >> the
>> >> >> PE
>> >> >> died last time.
>> >> >>
>> >> >>> taking down a production cluster. Here is the part of the log
>> >> >>> where pengine died:
>> >> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]:   notice:
>> >> >>> pcmk_child_exit: Child process pengine terminated with signal
>> >> >>> 6
>> >> >>> (pid=19357, core=128)
>> >> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]:   notice:
>> >> >>> pcmk_child_exit: Respawning failed child process: pengine
>> >> >>> Feb 11 17:05:16 storage0 pengine[12660]:   notice:
>> >> >>> crm_add_logfile: Additional logging available in
>> >> >>> /var/log/corosync.log
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: crm_ipc_read:
>> >> >>> Connection to pengine failed
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error:
>> >> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed
>> >> >>> (I/O condition=25)
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:     crit:
>> >> >>> pe_ipc_destroy:
>> >> >>> Connection to the Policy Engine failed (pid=-1,
>> >> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b)
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
>> >> >>> save_cib_contents: Saved CIB contents after PE crash to
>> >> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.
>> >> >>>  bz2
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
>> >> >>> Input I_ERROR from save_cib_contents() received in state
>> >> >>> S_POLICY_ENGINE
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning:
>> >> >>> do_state_transition: State transition S_POLICY_ENGINE ->
>> >> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL
>> >> >>> origin=save_cib_contents ]
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_recover:
>> >> >>> Action A_RECOVER (0000000001000000) not supported
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:  warning:
>> >> >>> do_election_vote:
>> >> >>> Not voting in election, we're in state S_RECOVERY
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_log: FSA:
>> >> >>> Input I_TERMINATE from do_recover() received in state
>> >> >>> S_RECOVERY
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:   notice:
>> >> >>> terminate_cs_connection: Disconnecting from Corosync
>> >> >>> Feb 11 17:05:16 storage0 crmd[19358]:    error: do_exit: Could
>> >> >>> not recover from internal error
>> >> >>>
>> >> >>> The rest of the log:
>> >> >>> http://sources.xes-inc.com/downloads/pengine.log
>> >> >>> Looking through the full log, it seems that pengine recovers,
>> >> >>
>> >> >> Right, pacemakerd watches for this and restarts it.
>> >> >>
>> >> >>> but perhaps not quickly enough to prevent the STONITH and
>> >> >>> resource migration?
>> >> >>
>> >> >> Highly likely.
>> >> >> However the PE crashing is quite serious.  I'd like to get to
>> >> >> the
>> >> >> bottom of that ASAP.
>> >> >>
>> >> >>>
>> >> >>> Here is the pe-core dump file mentioned in the log:
>> >> >>> http://sources.xes-inc.com/downloads/pe-core.bz2
>> >> >>
>> >> >> Unfortunately core files are specific to the machine that
>> >> >> generated them.
>> >> >> If you create a crm_report for about that time, it will open it
>> >> >> and
>> >> >> record a backtrace for us to look at.
>> >> >>
>> >> >> Also very important is the contents of:
>> >> >>    
>> >> >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2
>> >> >
>> >> > Ohhh, thats what the pe-core link was.
>> >> > I've run it through crm_simulate but couldn't reproduce the
>> >> > crash.
>> >> >
>> >> > So we'll still need the crm_report, it will have more detail on
>> >> > the
>> >> > "Child process pengine terminated with signal 6 (pid=19357,
>> >> > core=128)"
>> >> > part.
>> >>
>> >> Signal 6 is an assertion failure, but strangely there is no
>> >> mention
>> >> of
>> >> one in syslog.
>> >> Can you grep /var/log/corosync.log for lines containing 19357
>> >> please?
>> >>
>> > Andrew,
>> >
>> > Thanks for the help. Here are the lines containing 19357:
>> > http://sources.xes-inc.com/downloads/19357.log
>> > cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource.
>> > Postfix
>> > is installed and running, so I am not sure why these failures are
>> > occurring.
>> >
>> >> > The core file will likely be somewhere under
>> >> > /var/lib/pacemaker/cores
>> > That directory doesn't exist on this server, and it doesn't appear
>> > to be in /var/crash either:
>>
>> It looks like /var/lib/heartbeat/cores/ on your system.
>>
>> > # ls /var/crash/ -ltr
>> > total 67548
>> > -rw-r----- 1 hacluster whoopsie  1293711 Feb  6 10:01
>> > _usr_libexec_pacemaker_pengine.110.crash
>> > ---------- 1 root      whoopsie 67874816 Feb 11 17:07
>> > _usr_libexec_pacemaker_lrmd.0.crash
>> > In case they would be helpful, here are those two files:
>> > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash
>> > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash
>> >
>> > Here is the crm_report from storage0 from this time period:
>> > http://sources.xes-inc.com/downloads/pengine-report.tar.bz2
>>
>> Are you sure?
>> The pengine crashed on "Feb 11 17:05:15" but the report appears to be
>> from "Tue Feb 12 09:59:50 EST 2013" to "Tue Feb 12 10:30:10 EST 2013"
>>
>> There was one crash in there, but it was of the lrmd.
>> Unfortunately it looks like the binaries and libraries have been
>> stripped.
>>
>> Where did you get them from?  Do you know how to install the -debug
>> packages?
>
> Andrew,
>
> I ran crm_report again as follows:
> # crm_report -f "2013-02-11 17:00:00" -t "2013-02-11 17:30:00" \
> -n "storage0 storage1 storagequorum" -C /tmp/report
> ...
> storage0:   Collecting data from  storage0 storage1 storagequorum (02/11/2013 
> 05:00:00 PM to 02/11/2013 05:30:00 PM)
> ...
> storage1:   Found core file: -rw-r----- 1 root root 18485248 Feb 11 17:10 
> /var/lib/heartbeat/cores/root/core.7678
>
>
> Here is the report it generated:
> http://sources.xes-inc.com/downloads/storage-report.bz2
>
>
> I created these packages with checkinstall (using the normal Pacemaker
> build process, but substituting checkinstall for "make install"). By
> default it strips debugging information when generating the package,
> which I thought was desireable for a production environment.


Oh it is... right up until the point anything crashes :)
Which should never happen of course, but this is why distros often
ship a "debug" package with the stripped out symbols - they can be
installed afterwards if anything goes wrong.

> I also
> have a debug version of the package, which I will install now. I am
> also working to build Ubuntu packages more officially using
> dpkg-buildpackage. Is there a better way to create these packages?

I don't have a lot of expertise with debian based distros, perhaps
someone else can suggest one...

> I
> would prefer to not have to install build tools and compile the source
> directly on production servers.
>
> Thanks,
>
> Andrew
>
>
>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Reason for cluster resource migration

Reply via email to