On Wed, Feb 13, 2013 at 3:52 PM, Andrew Beekhof <and...@beekhof.net> wrote: > On Wed, Feb 13, 2013 at 2:04 AM, Andrew Martin <amar...@xes-inc.com> wrote: >> ----- Original Message ----- >>> From: "Andrew Beekhof" <and...@beekhof.net> >>> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >>> Sent: Monday, February 11, 2013 10:11:53 PM >>> Subject: Re: [Pacemaker] Reason for cluster resource migration >>> >>> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof <and...@beekhof.net> >>> wrote: >>> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof >>> > <and...@beekhof.net> wrote: >>> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin >>> >> <amar...@xes-inc.com> wrote: >>> >>> Hello, >>> >>> >>> >>> Unfortunately this same failure occurred again tonight, >>> >> >>> >> It might be the same effect, but there was no indication that the >>> >> PE >>> >> died last time. >>> >> >>> >>> taking down a production cluster. Here is the part of the log >>> >>> where pengine died: >>> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]: notice: >>> >>> pcmk_child_exit: Child process pengine terminated with signal 6 >>> >>> (pid=19357, core=128) >>> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]: notice: >>> >>> pcmk_child_exit: Respawning failed child process: pengine >>> >>> Feb 11 17:05:16 storage0 pengine[12660]: notice: >>> >>> crm_add_logfile: Additional logging available in >>> >>> /var/log/corosync.log >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: crm_ipc_read: >>> >>> Connection to pengine failed >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: >>> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed >>> >>> (I/O condition=25) >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: crit: pe_ipc_destroy: >>> >>> Connection to the Policy Engine failed (pid=-1, >>> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b) >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: >>> >>> save_cib_contents: Saved CIB contents after PE crash to >>> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b. >>> >>> bz2 >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: >>> >>> Input I_ERROR from save_cib_contents() received in state >>> >>> S_POLICY_ENGINE >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: >>> >>> do_state_transition: State transition S_POLICY_ENGINE -> >>> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL >>> >>> origin=save_cib_contents ] >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_recover: >>> >>> Action A_RECOVER (0000000001000000) not supported >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: do_election_vote: >>> >>> Not voting in election, we're in state S_RECOVERY >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: >>> >>> Input I_TERMINATE from do_recover() received in state S_RECOVERY >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: >>> >>> terminate_cs_connection: Disconnecting from Corosync >>> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_exit: Could >>> >>> not recover from internal error >>> >>> >>> >>> The rest of the log: >>> >>> http://sources.xes-inc.com/downloads/pengine.log >>> >>> Looking through the full log, it seems that pengine recovers, >>> >> >>> >> Right, pacemakerd watches for this and restarts it. >>> >> >>> >>> but perhaps not quickly enough to prevent the STONITH and >>> >>> resource migration? >>> >> >>> >> Highly likely. >>> >> However the PE crashing is quite serious. I'd like to get to the >>> >> bottom of that ASAP. >>> >> >>> >>> >>> >>> Here is the pe-core dump file mentioned in the log: >>> >>> http://sources.xes-inc.com/downloads/pe-core.bz2 >>> >> >>> >> Unfortunately core files are specific to the machine that >>> >> generated them. >>> >> If you create a crm_report for about that time, it will open it >>> >> and >>> >> record a backtrace for us to look at. >>> >> >>> >> Also very important is the contents of: >>> >> >>> >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2 >>> > >>> > Ohhh, thats what the pe-core link was. >>> > I've run it through crm_simulate but couldn't reproduce the crash. >>> > >>> > So we'll still need the crm_report, it will have more detail on the >>> > "Child process pengine terminated with signal 6 (pid=19357, >>> > core=128)" >>> > part. >>> >>> Signal 6 is an assertion failure, but strangely there is no mention >>> of >>> one in syslog. >>> Can you grep /var/log/corosync.log for lines containing 19357 please? >>> >> Andrew, >> >> Thanks for the help. Here are the lines containing 19357: >> http://sources.xes-inc.com/downloads/19357.log >> cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource. Postfix >> is installed and running, so I am not sure why these failures are occurring. >> >>> > The core file will likely be somewhere under >>> > /var/lib/pacemaker/cores >> That directory doesn't exist on this server, and it doesn't appear to be in >> /var/crash either: > > It looks like /var/lib/heartbeat/cores/ on your system. > >> # ls /var/crash/ -ltr >> total 67548 >> -rw-r----- 1 hacluster whoopsie 1293711 Feb 6 10:01 >> _usr_libexec_pacemaker_pengine.110.crash >> ---------- 1 root whoopsie 67874816 Feb 11 17:07 >> _usr_libexec_pacemaker_lrmd.0.crash >> In case they would be helpful, here are those two files: >> http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash >> http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash >> >> Here is the crm_report from storage0 from this time period: >> http://sources.xes-inc.com/downloads/pengine-report.tar.bz2 > > Are you sure? > The pengine crashed on "Feb 11 17:05:15" but the report appears to be > from "Tue Feb 12 09:59:50 EST 2013" to "Tue Feb 12 10:30:10 EST 2013" > > There was one crash in there, but it was of the lrmd. > Unfortunately it looks like the binaries and libraries have been stripped. > > Where did you get them from? Do you know how to install the -debug packages?
This link has some useful info: https://wiki.ubuntu.com/DebuggingProgramCrash#Debug_Symbol_Packages _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org