Thanks a lot for the help, Michael! At the end it turned out that the number of service groups we had was the cause of the huge startup time. We realized we could live without them and now the whole reload process takes about 20 seconds. This made the reload issue that I originally wrote in the thread practically go away. I'm not sure if it's just extremely rare or it really went away, but we haven't seen the issue for a month now and we're reloading about once every hour.
On Wed, Oct 8, 2014 at 8:09 PM, Michael Friedrich < [email protected]> wrote: > Am 08.10.2014 um 11:41 schrieb Zsolt Dollenstein: > >> >> >> On Tue, Oct 7, 2014 at 6:00 PM, Michael Friedrich >> <[email protected] <mailto:[email protected]>> >> wrote: >> >> Hi, >> >> Am 01.10.2014 um 10:55 schrieb Zsolt Dollenstein: >> >>> Hi, >>> >>> We at Prezi are trying to migrate over to icinga2 and we've hit >>> what seems like a showstopper for us. We've spent about 2 days >>> trying to debug the issue to no avail, so any pointers are welcome. >>> >> >> Which version of Icinga 2, and how was Icinga 2 installed on which >> distribution? >> >> >> We are running off of the current master (with these >> <https://github.com/prezi/icinga2/compare/prezi-release> changes) on >> ubuntu. We built icinga2 with the debian packaging mechanisms in the >> repo (using dpkg-buildpackage). >> > > Uhm. Keep in mind that the master branch ist used for current > development towards the 2.2 feature milestone. Today the cli commands > base has been merged which introduces certain changes. > > If I were you, I would go for support/2.1 and build based upon that, and > only switch to master if this is really just a playground and developers > demand you to test something. > > >> >>> In short, the issue is this: sometimes when we reload our icinga2 >>> config (via SIGHUP), both the new and old icinga2 processes stop >>> working. This happens about once every 4 reloads. >>> >> >> What comes to mind: Try strace and/or gdb attaching the 2 >> processes and trace their actions after sending a SIGHUP signal. >> >> http://docs.icinga.org/icinga2/latest/doc/module/icinga2/chapter/ >> troubleshooting#debug >> >> >> Thanks, we haven't tried gdb yet, will give it a shot soon. >> Strace was not terribly helpful because of the amount of active checks >> (maybe we should try without tracing forks and just attach it to the >> two processes). >> > > Hmmm, yeah, I'd only look into what the "parent" process is doing before > it stops it's operation. > > What comes to mind - when the old parent process receives the > termination signal, it stores it's state data into the icinga2.state > file. Once the process has terminated sucessfully, the child process > takes over and re-reads the state file to ensure that the history is w/o > loss after doing config validation. Might be some sort of race condition > over here, but that's just a blind guess. > > The current git master moves the daemon code into the cli subcommand, so > you'll find that currently below lib/cli/daemoncommand.cpp currently > (could be changed in the next weeks though). > > For compiling - the package build uses the release flag with debug > symbols. You might want to recompile in debug mode to get more > information too. CMAKE_BUILD_TYPE=Debug is what you're looking for. > > Below is an excerpt of my bashrc I'm using for building different types > of Icinga 2. > > export CMAKE_OPTS_DEBUG="-DCMAKE_INSTALL_PREFIX=/usr > -DCMAKE_INSTALL_SYSCONFDIR=/etc -DCMAKE_INSTALL_LOCALSTATEDIR=/var > -DCMAKE_BUILD_TYPE=Debug -DICINGA2_USER=icinga -DICINGA2_GROUP=icinga > -DICINGA2_COMMAND_USER=icinga -DICINGA2_COMMAND_GROUP=icingacmd" > export CMAKE_OPTS_NORMAL="-DCMAKE_INSTALL_PREFIX=/usr > -DCMAKE_INSTALL_SYSCONFDIR=/etc -DCMAKE_INSTALL_LOCALSTATEDIR=/var > -DCMAKE_BUILD_TYPE=RelWithDebInfo -DICINGA2_USER=icinga > -DICINGA2_GROUP=icinga -DICINGA2_COMMAND_USER=icinga > -DICINGA2_COMMAND_GROUP=icingacmd" > > alias icinga2_debug='rm -rf build ; mkdir build ; cd build ; cmake > $CMAKE_OPTS_DEBUG .. ; sudo make -j8 install ; cd ..' > alias icinga2_normal='rm -rf build ; mkdir build ; cd build ; cmake > $CMAKE_OPTS_NORMAL .. ; sudo make -j8 install ; cd ..' > > > > >>> From the logs it looks like the old process thinks all is well >>> and is terminating as expected (AFAICT the new process kills it >>> properly). I can't find any logs from the new process, not to >>> mention any errors/warnings. We have no idea why the new process >>> stops. We have tried to turn on debug logging to no avail. We >>> even tried patching the code to see more logs from the child >>> process, and we were able to verify that it successfully parses >>> the configs and proceeds to shut down the parent. >>> >> >> May we see these modifications (git patch)? Maybe there's some >> additional logging missing here. >> >> >> Sure, https://github.com/prezi/icinga2/compare/prezi-release >> > > Hmmm. There are some patches on that list which would make sense > upstream, whilst otherwise you're keeping a local fork (no-one wants > that). Feel free to open issues with attached patches / pr urls. > > specifically, I meant this: >> https://github.com/prezi/icinga2/commit/ad90733b67a204754523206e757c48 >> f948ae906a >> >> > That looks like one past issue we have when a reload does not log any > feedback. I'll talk with Gunnar about that tomorrow. > > and another which I haven't bothered to check in (this was to make >> sure the child's stdout is not swallowed): >> https://gist.github.com/zsol/00d5bb59b12d48406810 >> >> >>> This is a big problem for us because we have a biggish config >>> (about 30K services and 90K Notifications), so starting up (or >>> validating the configuration) takes about 5 minutes on a decent >>> machine, which means when this scenario happens, we're flying >>> blind for that amount of time. >>> >> >> Just curious - what's a "decent machine"? 5 minutes sounds way too >> much for that amount of objects. >> >> >> It's a c3.2xlarge type instance on AWS EC2: >> http://www.ec2instances.info/?filter=c3.2xl >> > > So 8 cores, 16gb and fast disks. > > >> Awesome to hear this because we thought it was weird, too :) Maybe >> I'll find some time to profile config parsing. >> > > Most likely you'll go the Valgrind way, or profile only configuration > snippets. > > https://blog.netways.de/2013/09/05/profiling-mit-gperftools/ > > Not sure what else may help, but we'll see. > > Kind regards, > Michael > > >> >>> Any pointers are appreciated. >>> >>> [apologies for possibly duplicate emails, I think one copy of >>> this is sitting in the moderation queue] >>> >> >> Will remove that later on, no worries. >> >> Kind regards, >> Michael >> >> >> -- >> Michael Friedrich, DI (FH) >> Application Developer >> >> NETWAYS GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg >> Tel: +49 911 92885-0 | Fax: +49 911 92885-77 >> <tel:%2B49%20911%2092885-77> >> GF: Julian Hein, Bernd Erk | AG Nuernberg HRB18461 >> http://www.netways.de | [email protected] >> <mailto:[email protected]> >> >> ** Puppet Camp Duesseldorf 2014 - Oktober - netways.de/puppetcamp >> <http://netways.de/puppetcamp> ** >> ** OSMC 2014 - November - netways.de/osmc <http://netways.de/osmc> ** >> ** OpenNebula Conf 2014 - Dezember - opennebulaconf.com >> <http://opennebulaconf.com> ** >> ** OSDC 2015 - April - osdc.de <http://osdc.de> ** >> >> >> >> >> -- >> >> *Zsolt Dollenstein* >> Developer at Prezi <http://prezi.com> >> >> > > -- > Michael Friedrich, DI (FH) > Application Developer > > NETWAYS GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > GF: Julian Hein, Bernd Erk | AG Nuernberg HRB18461 > http://www.netways.de | [email protected] > > ** Puppet Camp Duesseldorf 2014 - Oktober - netways.de/puppetcamp ** > ** OSMC 2014 - November - netways.de/osmc ** > ** OpenNebula Conf 2014 - Dezember - opennebulaconf.com ** > ** OSDC 2015 - April - osdc.de ** > -- *Zsolt Dollenstein* Developer at Prezi <http://prezi.com>
_______________________________________________ icinga-users mailing list [email protected] https://lists.icinga.org/mailman/listinfo/icinga-users
