On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof <and...@beekhof.net> wrote: > On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin <amar...@xes-inc.com> wrote: >> Hello, >> >> Unfortunately this same failure occurred again tonight, > > It might be the same effect, but there was no indication that the PE > died last time. > >> taking down a production cluster. Here is the part of the log where pengine >> died: >> Feb 11 17:05:15 storage0 pacemakerd[1572]: notice: pcmk_child_exit: Child >> process pengine terminated with signal 6 (pid=19357, core=128) >> Feb 11 17:05:16 storage0 pacemakerd[1572]: notice: pcmk_child_exit: >> Respawning failed child process: pengine >> Feb 11 17:05:16 storage0 pengine[12660]: notice: crm_add_logfile: >> Additional logging available in /var/log/corosync.log >> Feb 11 17:05:16 storage0 crmd[19358]: error: crm_ipc_read: Connection to >> pengine failed >> Feb 11 17:05:16 storage0 crmd[19358]: error: mainloop_gio_callback: >> Connection to pengine[0x891680] closed (I/O condition=25) >> Feb 11 17:05:16 storage0 crmd[19358]: crit: pe_ipc_destroy: Connection >> to the Policy Engine failed (pid=-1, >> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b) >> Feb 11 17:05:16 storage0 crmd[19358]: notice: save_cib_contents: Saved CIB >> contents after PE crash to >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b. bz2 >> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: Input I_ERROR >> from save_cib_contents() received in state S_POLICY_ENGINE >> Feb 11 17:05:16 storage0 crmd[19358]: warning: do_state_transition: State >> transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR >> cause=C_FSA_INTERNAL origin=save_cib_contents ] >> Feb 11 17:05:16 storage0 crmd[19358]: error: do_recover: Action A_RECOVER >> (0000000001000000) not supported >> Feb 11 17:05:16 storage0 crmd[19358]: warning: do_election_vote: Not voting >> in election, we're in state S_RECOVERY >> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: Input >> I_TERMINATE from do_recover() received in state S_RECOVERY >> Feb 11 17:05:16 storage0 crmd[19358]: notice: terminate_cs_connection: >> Disconnecting from Corosync >> Feb 11 17:05:16 storage0 crmd[19358]: error: do_exit: Could not recover >> from internal error >> >> The rest of the log: >> http://sources.xes-inc.com/downloads/pengine.log >> Looking through the full log, it seems that pengine recovers, > > Right, pacemakerd watches for this and restarts it. > >> but perhaps not quickly enough to prevent the STONITH and resource migration? > > Highly likely. > However the PE crashing is quite serious. I'd like to get to the > bottom of that ASAP. > >> >> Here is the pe-core dump file mentioned in the log: >> http://sources.xes-inc.com/downloads/pe-core.bz2 > > Unfortunately core files are specific to the machine that generated them. > If you create a crm_report for about that time, it will open it and > record a backtrace for us to look at. > > Also very important is the contents of: > /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2
Ohhh, thats what the pe-core link was. I've run it through crm_simulate but couldn't reproduce the crash. So we'll still need the crm_report, it will have more detail on the "Child process pengine terminated with signal 6 (pid=19357, core=128)" part. The core file will likely be somewhere under /var/lib/pacemaker/cores but crm_report should be able to find it. > >> >> Thanks, >> >> Andrew >> >> >> >> >> ----- Original Message ----- >>> From: "Andrew Martin" <amar...@xes-inc.com> >>> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >>> Sent: Friday, February 1, 2013 4:32:26 PM >>> Subject: Re: [Pacemaker] Reason for cluster resource migration >>> >>> ----- Original Message ----- >>> > From: "Andrew Beekhof" <and...@beekhof.net> >>> > To: "The Pacemaker cluster resource manager" >>> > <pacemaker@oss.clusterlabs.org> >>> > Sent: Thursday, December 6, 2012 8:36:27 PM >>> > Subject: Re: [Pacemaker] Reason for cluster resource migration >>> > >>> > On Wed, Dec 5, 2012 at 8:29 AM, Andrew Martin <amar...@xes-inc.com> >>> > wrote: >>> > > Hello, >>> > > >>> > > I am running a 3-node Pacemaker cluster (2 "real" nodes and 1 >>> > > quorum node in >>> > > standby) on Ubuntu 12.04 server (amd64) with Pacemaker 1.1.8 and >>> > > Corosync >>> > > 2.1.0. My cluster configuration is: >>> > > http://pastebin.com/6TPkWtbt >>> > > >>> > > Recently, pengine died on storage0 (where the resources were >>> > > running) which >>> > > also happened to be the DC at the time. Consequently, Pacemaker >>> > > went into >>> > > recovery mode and released its role as DC, at which point >>> > > storage1 >>> > > took over >>> > > the DC role and migrated the resources away from storage0 and >>> > > onto >>> > > storage1. >>> > > Looking through the logs, it seems like storage0 came back into >>> > > the >>> > > cluster >>> > > before the migration of the resources began: >>> > > Dec 03 08:31:20 [3165] storage1 crmd: info: >>> > > peer_update_callback: >>> > > Client storage0/peer now has status [online] (DC=true) >>> > > ... >>> > > Dec 03 08:31:20 [3164] storage1 pengine: notice: LogActions: >>> > > Start rscXXX (storage1) >>> > > >>> > > Thus, why did the migration occur, rather than aborting and >>> > > having >>> > > the >>> > > resources simply remain running on storage0? Here are the logs >>> > > from >>> > > each of >>> > > the nodes: >>> > > storage0: http://pastebin.com/ZqqnH9uf >>> > > storage1: http://pastebin.com/rvSLVcZs >>> > >>> > Hmm, thats an interesting one. >>> > Can you provide this file? It will hold the answer: >>> > >>> > Dec 03 08:31:31 [3164] storage1 pengine: notice: >>> > process_pe_message: Calculated Transition 1: >>> > /var/lib/pacemaker/pengine/pe-input-28.bz2 >>> > >>> > >>> > > >>> > > Thanks, >>> > > >>> > > Andrew >>> > > >>> > > _______________________________________________ >>> > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> > > >>> > > Project Home: http://www.clusterlabs.org >>> > > Getting started: >>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> > > Bugs: http://bugs.clusterlabs.org >>> > > >>> > >>> > _______________________________________________ >>> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> > >>> > Project Home: http://www.clusterlabs.org >>> > Getting started: >>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> > Bugs: http://bugs.clusterlabs.org >>> > >>> >>> Andrew, >>> >>> Sorry for the delayed response. Here is the file you requested: >>> http://sources.xes-inc.com/downloads/pe-input-28.bz2 >>> >>> This same condition just occurred again on storage1 today (pengine >>> died, and then storage1 was STONITHed). >>> >>> Thanks, >>> >>> Andrew >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org