On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof <and...@beekhof.net> wrote: > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof <and...@beekhof.net> wrote: >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin <amar...@xes-inc.com> wrote: >>> Hello, >>> >>> Unfortunately this same failure occurred again tonight, >> >> It might be the same effect, but there was no indication that the PE >> died last time. >> >>> taking down a production cluster. Here is the part of the log where pengine >>> died: >>> Feb 11 17:05:15 storage0 pacemakerd[1572]: notice: pcmk_child_exit: Child >>> process pengine terminated with signal 6 (pid=19357, core=128) >>> Feb 11 17:05:16 storage0 pacemakerd[1572]: notice: pcmk_child_exit: >>> Respawning failed child process: pengine >>> Feb 11 17:05:16 storage0 pengine[12660]: notice: crm_add_logfile: >>> Additional logging available in /var/log/corosync.log >>> Feb 11 17:05:16 storage0 crmd[19358]: error: crm_ipc_read: Connection to >>> pengine failed >>> Feb 11 17:05:16 storage0 crmd[19358]: error: mainloop_gio_callback: >>> Connection to pengine[0x891680] closed (I/O condition=25) >>> Feb 11 17:05:16 storage0 crmd[19358]: crit: pe_ipc_destroy: Connection >>> to the Policy Engine failed (pid=-1, >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b) >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: save_cib_contents: Saved >>> CIB contents after PE crash to >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b. >>> bz2 >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: Input I_ERROR >>> from save_cib_contents() received in state S_POLICY_ENGINE >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: do_state_transition: State >>> transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR >>> cause=C_FSA_INTERNAL origin=save_cib_contents ] >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_recover: Action >>> A_RECOVER (0000000001000000) not supported >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: do_election_vote: Not >>> voting in election, we're in state S_RECOVERY >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: Input >>> I_TERMINATE from do_recover() received in state S_RECOVERY >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: terminate_cs_connection: >>> Disconnecting from Corosync >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_exit: Could not recover >>> from internal error >>> >>> The rest of the log: >>> http://sources.xes-inc.com/downloads/pengine.log >>> Looking through the full log, it seems that pengine recovers, >> >> Right, pacemakerd watches for this and restarts it. >> >>> but perhaps not quickly enough to prevent the STONITH and resource >>> migration? >> >> Highly likely. >> However the PE crashing is quite serious. I'd like to get to the >> bottom of that ASAP. >> >>> >>> Here is the pe-core dump file mentioned in the log: >>> http://sources.xes-inc.com/downloads/pe-core.bz2 >> >> Unfortunately core files are specific to the machine that generated them. >> If you create a crm_report for about that time, it will open it and >> record a backtrace for us to look at. >> >> Also very important is the contents of: >> >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2 > > Ohhh, thats what the pe-core link was. > I've run it through crm_simulate but couldn't reproduce the crash. > > So we'll still need the crm_report, it will have more detail on the > "Child process pengine terminated with signal 6 (pid=19357, core=128)" > part.
Signal 6 is an assertion failure, but strangely there is no mention of one in syslog. Can you grep /var/log/corosync.log for lines containing 19357 please? > The core file will likely be somewhere under /var/lib/pacemaker/cores > but crm_report should be able to find it. > >> >>> >>> Thanks, >>> >>> Andrew >>> >>> >>> >>> >>> ----- Original Message ----- >>>> From: "Andrew Martin" <amar...@xes-inc.com> >>>> To: "The Pacemaker cluster resource manager" >>>> <pacemaker@oss.clusterlabs.org> >>>> Sent: Friday, February 1, 2013 4:32:26 PM >>>> Subject: Re: [Pacemaker] Reason for cluster resource migration >>>> >>>> ----- Original Message ----- >>>> > From: "Andrew Beekhof" <and...@beekhof.net> >>>> > To: "The Pacemaker cluster resource manager" >>>> > <pacemaker@oss.clusterlabs.org> >>>> > Sent: Thursday, December 6, 2012 8:36:27 PM >>>> > Subject: Re: [Pacemaker] Reason for cluster resource migration >>>> > >>>> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin <amar...@xes-inc.com> >>>> > wrote: >>>> > > Hello, >>>> > > >>>> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and 1 >>>> > > quorum node in >>>> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 and >>>> > > Corosync >>>> > > 2.1.0. My cluster configuration is: >>>> > > http://pastebin.com/6TPkWtbt >>>> > > >>>> > > Recently, pengine died on storage0 (where the resources were >>>> > > running) which >>>> > > also happened to be the DC at the time. Consequently, Pacemaker >>>> > > went into >>>> > > recovery mode and released its role as DC, at which point >>>> > > storage1 >>>> > > took over >>>> > > the DC role and migrated the resources away from storage0 and >>>> > > onto >>>> > > storage1. >>>> > > Looking through the logs, it seems like storage0 came back into >>>> > > the >>>> > > cluster >>>> > > before the migration of the resources began: >>>> > > Dec 03 08:31:20 [3165] storage1 crmd: info: >>>> > > peer_update_callback: >>>> > > Client storage0/peer now has status [online] (DC=true) >>>> > > ... >>>> > > Dec 03 08:31:20 [3164] storage1 pengine: notice: LogActions: >>>> > > Start rscXXX (storage1) >>>> > > >>>> > > Thus, why did the migration occur, rather than aborting and >>>> > > having >>>> > > the >>>> > > resources simply remain running on storage0? Here are the logs >>>> > > from >>>> > > each of >>>> > > the nodes: >>>> > > storage0: http://pastebin.com/ZqqnH9uf >>>> > > storage1: http://pastebin.com/rvSLVcZs >>>> > >>>> > Hmm, thats an interesting one. >>>> > Can you provide this file? It will hold the answer: >>>> > >>>> > Dec 03 08:31:31 [3164] storage1 pengine: notice: >>>> > process_pe_message: Calculated Transition 1: >>>> > /var/lib/pacemaker/pengine/pe-input-28.bz2 >>>> > >>>> > >>>> > > >>>> > > Thanks, >>>> > > >>>> > > Andrew >>>> > > >>>> > > _______________________________________________ >>>> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> > > >>>> > > Project Home: http://www.clusterlabs.org >>>> > > Getting started: >>>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> > > Bugs: http://bugs.clusterlabs.org >>>> > > >>>> > >>>> > _______________________________________________ >>>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> > >>>> > Project Home: http://www.clusterlabs.org >>>> > Getting started: >>>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> > Bugs: http://bugs.clusterlabs.org >>>> > >>>> >>>> Andrew, >>>> >>>> Sorry for the delayed response. Here is the file you requested: >>>> http://sources.xes-inc.com/downloads/pe-input-28.bz2 >>>> >>>> This same condition just occurred again on storage1 today (pengine >>>> died, and then storage1 was STONITHed). >>>> >>>> Thanks, >>>> >>>> Andrew >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org