On Thu, Feb 14, 2013 at 4:28 AM, Andrew Martin <amar...@xes-inc.com> wrote: > ----- Original Message ----- >> From: "Andrew Beekhof" <and...@beekhof.net> >> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >> Sent: Tuesday, February 12, 2013 10:52:23 PM >> Subject: Re: [Pacemaker] Reason for cluster resource migration >> >> On Wed, Feb 13, 2013 at 2:04 AM, Andrew Martin <amar...@xes-inc.com> >> wrote: >> > ----- Original Message ----- >> >> From: "Andrew Beekhof" <and...@beekhof.net> >> >> To: "The Pacemaker cluster resource manager" >> >> <pacemaker@oss.clusterlabs.org> >> >> Sent: Monday, February 11, 2013 10:11:53 PM >> >> Subject: Re: [Pacemaker] Reason for cluster resource migration >> >> >> >> On Tue, Feb 12, 2013 at 3:07 PM, Andrew Beekhof >> >> <and...@beekhof.net> >> >> wrote: >> >> > On Tue, Feb 12, 2013 at 3:01 PM, Andrew Beekhof >> >> > <and...@beekhof.net> wrote: >> >> >> On Tue, Feb 12, 2013 at 1:40 PM, Andrew Martin >> >> >> <amar...@xes-inc.com> wrote: >> >> >>> Hello, >> >> >>> >> >> >>> Unfortunately this same failure occurred again tonight, >> >> >> >> >> >> It might be the same effect, but there was no indication that >> >> >> the >> >> >> PE >> >> >> died last time. >> >> >> >> >> >>> taking down a production cluster. Here is the part of the log >> >> >>> where pengine died: >> >> >>> Feb 11 17:05:15 storage0 pacemakerd[1572]: notice: >> >> >>> pcmk_child_exit: Child process pengine terminated with signal >> >> >>> 6 >> >> >>> (pid=19357, core=128) >> >> >>> Feb 11 17:05:16 storage0 pacemakerd[1572]: notice: >> >> >>> pcmk_child_exit: Respawning failed child process: pengine >> >> >>> Feb 11 17:05:16 storage0 pengine[12660]: notice: >> >> >>> crm_add_logfile: Additional logging available in >> >> >>> /var/log/corosync.log >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: crm_ipc_read: >> >> >>> Connection to pengine failed >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: >> >> >>> mainloop_gio_callback: Connection to pengine[0x891680] closed >> >> >>> (I/O condition=25) >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: crit: >> >> >>> pe_ipc_destroy: >> >> >>> Connection to the Policy Engine failed (pid=-1, >> >> >>> uuid=c9aef461-386c-4e4f-b509-0c9c8d80409b) >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: >> >> >>> save_cib_contents: Saved CIB contents after PE crash to >> >> >>> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b. >> >> >>> bz2 >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: >> >> >>> Input I_ERROR from save_cib_contents() received in state >> >> >>> S_POLICY_ENGINE >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: >> >> >>> do_state_transition: State transition S_POLICY_ENGINE -> >> >> >>> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL >> >> >>> origin=save_cib_contents ] >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_recover: >> >> >>> Action A_RECOVER (0000000001000000) not supported >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: warning: >> >> >>> do_election_vote: >> >> >>> Not voting in election, we're in state S_RECOVERY >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_log: FSA: >> >> >>> Input I_TERMINATE from do_recover() received in state >> >> >>> S_RECOVERY >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: notice: >> >> >>> terminate_cs_connection: Disconnecting from Corosync >> >> >>> Feb 11 17:05:16 storage0 crmd[19358]: error: do_exit: Could >> >> >>> not recover from internal error >> >> >>> >> >> >>> The rest of the log: >> >> >>> http://sources.xes-inc.com/downloads/pengine.log >> >> >>> Looking through the full log, it seems that pengine recovers, >> >> >> >> >> >> Right, pacemakerd watches for this and restarts it. >> >> >> >> >> >>> but perhaps not quickly enough to prevent the STONITH and >> >> >>> resource migration? >> >> >> >> >> >> Highly likely. >> >> >> However the PE crashing is quite serious. I'd like to get to >> >> >> the >> >> >> bottom of that ASAP. >> >> >> >> >> >>> >> >> >>> Here is the pe-core dump file mentioned in the log: >> >> >>> http://sources.xes-inc.com/downloads/pe-core.bz2 >> >> >> >> >> >> Unfortunately core files are specific to the machine that >> >> >> generated them. >> >> >> If you create a crm_report for about that time, it will open it >> >> >> and >> >> >> record a backtrace for us to look at. >> >> >> >> >> >> Also very important is the contents of: >> >> >> >> >> >> /var/lib/pacemaker/pengine/pe-core-c9aef461-386c-4e4f-b509-0c9c8d80409b.bz2 >> >> > >> >> > Ohhh, thats what the pe-core link was. >> >> > I've run it through crm_simulate but couldn't reproduce the >> >> > crash. >> >> > >> >> > So we'll still need the crm_report, it will have more detail on >> >> > the >> >> > "Child process pengine terminated with signal 6 (pid=19357, >> >> > core=128)" >> >> > part. >> >> >> >> Signal 6 is an assertion failure, but strangely there is no >> >> mention >> >> of >> >> one in syslog. >> >> Can you grep /var/log/corosync.log for lines containing 19357 >> >> please? >> >> >> > Andrew, >> > >> > Thanks for the help. Here are the lines containing 19357: >> > http://sources.xes-inc.com/downloads/19357.log >> > cl_sysadmin_notify is a clone of a ocf:heartbeat:MailTo resource. >> > Postfix >> > is installed and running, so I am not sure why these failures are >> > occurring. >> > >> >> > The core file will likely be somewhere under >> >> > /var/lib/pacemaker/cores >> > That directory doesn't exist on this server, and it doesn't appear >> > to be in /var/crash either: >> >> It looks like /var/lib/heartbeat/cores/ on your system. >> >> > # ls /var/crash/ -ltr >> > total 67548 >> > -rw-r----- 1 hacluster whoopsie 1293711 Feb 6 10:01 >> > _usr_libexec_pacemaker_pengine.110.crash >> > ---------- 1 root whoopsie 67874816 Feb 11 17:07 >> > _usr_libexec_pacemaker_lrmd.0.crash >> > In case they would be helpful, here are those two files: >> > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_pengine.110.crash >> > http://sources.xes-inc.com/downloads/_usr_libexec_pacemaker_lrmd.0.crash >> > >> > Here is the crm_report from storage0 from this time period: >> > http://sources.xes-inc.com/downloads/pengine-report.tar.bz2 >> >> Are you sure? >> The pengine crashed on "Feb 11 17:05:15" but the report appears to be >> from "Tue Feb 12 09:59:50 EST 2013" to "Tue Feb 12 10:30:10 EST 2013" >> >> There was one crash in there, but it was of the lrmd. >> Unfortunately it looks like the binaries and libraries have been >> stripped. >> >> Where did you get them from? Do you know how to install the -debug >> packages? > > Andrew, > > I ran crm_report again as follows: > # crm_report -f "2013-02-11 17:00:00" -t "2013-02-11 17:30:00" \ > -n "storage0 storage1 storagequorum" -C /tmp/report > ... > storage0: Collecting data from storage0 storage1 storagequorum (02/11/2013 > 05:00:00 PM to 02/11/2013 05:30:00 PM) > ... > storage1: Found core file: -rw-r----- 1 root root 18485248 Feb 11 17:10 > /var/lib/heartbeat/cores/root/core.7678 > > > Here is the report it generated: > http://sources.xes-inc.com/downloads/storage-report.bz2 > > > I created these packages with checkinstall (using the normal Pacemaker > build process, but substituting checkinstall for "make install"). By > default it strips debugging information when generating the package, > which I thought was desireable for a production environment.
Oh it is... right up until the point anything crashes :) Which should never happen of course, but this is why distros often ship a "debug" package with the stripped out symbols - they can be installed afterwards if anything goes wrong. > I also > have a debug version of the package, which I will install now. I am > also working to build Ubuntu packages more officially using > dpkg-buildpackage. Is there a better way to create these packages? I don't have a lot of expertise with debian based distros, perhaps someone else can suggest one... > I > would prefer to not have to install build tools and compile the source > directly on production servers. > > Thanks, > > Andrew > > > >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org