Re: [naemon-dev] Ideas about future features

Robin Sonefors Fri, 27 Dec 2013 15:23:40 -0800

On 2013-12-27 02:06, Matthias Eble wrote:

1) have a feature to monitor per metric rather than per check_command.
    * Today, many plugins check lots of things.
    * Typical example is check_disk, check_snmp.
    * Depending on the configuration method, acknowledging a problem
with /mytmpmount also disables notifications for /var
      * To fix that, we'd need to create a stricter plugin output
standard that contains per-metric status codes.
         * metrics would be /mytmpmount-freespace, TCP-response-time,
http-status-code or http-match-string
      * The core would need to create sub-services at run-time and
populate their results.
      * Benefit: per metric actions and logging. Especially per metric
downtime and acknowledgements
      * Maybe it could also be used for receiving snmp traps or log
pattern matching checks
         * different alerting for different patterns/traps


    * Today, many folks wrap the plugin call and submit results to a
passive check.
       * works, but all possible services need to be in the config.
       * That's where folks start generating nagios configs and reload
the daemon.
       * Is that what we want? Maybe?
          * problems arise when there are syntax problems

    * raw proposal:
   define service {
   ...
       check_command  check_disk
     contact_group  os_admins
     define metric {
         metric_name  ^/oracle.*
         contact_group oracle_admins
     }
   }

    * maybe another layer could be added for check_multi-like plugins.
       * but they could also be forced to structure metric names


I've been thinking about plugins and plugin architecture a bit.

The nagiosplugins project is talking about a new threshold format -https://www.nagios-plugins.org/doc/new-threshold-syntax.html - toachieve the same thing you want to solve in-core.

I think the nagiosplugins approach - basically, update all plugins tosupport a much more complex (though easier to understand) thresholdformat, because the old one was too complicated - is wrong. Programmerswrite buggy code, and telling programmers to write more code leads tomore bugs (or at least I write buggy code, and I'm too stupid to writeplugins already - that's why I stick to the core :P )

But I'm also not sure how far into the core the I'd want to put it. Whatif we, instead of either change the core or the plugins, write a pluginwrapper that takes a threshold as described by nagiosplugins and aplugin command line? It would simply parse the perfdata from the plugin,the threshold from the CLI, throw away the plugin exit code, and send anew, "imploved" exit code and stdout to naemon?

I feel this plugin wrapper approach would take the least amount of workto implement. Which problems would it leave unsolved?


> [snip]

What do you think? What's the focus of the dev-team?

So far, it seems the focus is mostly on cleanup. There's just *so**much* ancient *crap* lying around. Tens of thousands of lines of codeto create a ugly, useless web UI, which force me to ifdef every secondline - what? Three (or so, I lost count) different configuration parsersfor near-identical-yet-subtly-different configuration file formats -really? And then the amount of special casing for things that you mightexpect to behave similarly until you find out the hard way that theyreally don't - for instance, I always thought the flapping calculationwas based on the last 20 (or so) check results, but nooo:https://github.com/naemon/naemon-core/blob/master/naemon/flapping.c#L116I think the current score is something like -120k lines compared to theinitial code import, but there's a lot more we could do.

Oh, and testing. One of the scariest things when starting on a newcodebase is realizing that there are no tests at all. The only thingworse than that is finding a directory full of tests - granted, allcovered in cobweb and dust - and you think (or hope, or whatever youcall that feeling when you know you must never assume good things butstill want to) that you might have found The Book of Shadows in theattic, but after dusting the code off and flipping through it, it turnsout that nobody has executed (or even compiled) any code here for*years*, and half the tests test features that doesn't even existanymore. It dawns on you that somebody spent days - weeks, even -writing tests to avoid regressions - and then didn't run the test andthus didn't catch the regressions. See: t-tap, where a few of the testsfiles actually work, and none of them have a working build system ATM.

As far as I'm going to go in terms of longer-term vision andthe-way-to-go-iness, I'd like to modularize the crap out of the core.The nagios "core" is anything but, as explained above. It would be neatto lift out a bunch of nagios functionality into a bundle ofpreinstalled modules. This would serve two purposes: it would force usto dogfood the broker API and thus help us improve it, and it wouldcompartmentalize features (new, and old) to avoid weird interactionswith other features.

The broker API as it exists is terrible - you're just given all of thenaemon internals, spotty and inconsistent hooks, and a "good luck". Thismeans that, as a core developer, any change I make at all is bound tobreak some module, while as a module author, I need to learn all of thecore to write a module. And you want to store your own add-onconfiguration/data? Hah! So, in the end, it's just easier to become acore contributor, because who has the time not to?

What would happen if, to take an example that sounds weird but makessome kind of sense, the flapping functionality was a module? That wouldrequire some extra module functionality - modules would have to be ableto add configuration statements to the config (global and per-object)for configuring flapping thresholds, and modules would have to be ableto couple state (is_flapping, last 20 check results) with the object andhave it persist between restarts. Now, what if this was the easiest,most concise, and easiest-to-find-out-how way to do it?

I think a module should be able to do all these things - and if it coulddo that, and if flapping was a module, I would not ever again have toworry about flapping in the remaining core, nor would I wonder where allspecial cases for flapping are handled - heck, I could even see if theflapping feature has tests and how extensive they are, just from lookingat github.com/naemon/flapping ! Today, almost all features - includingflapping - is handled by the pair of ogres known ashandle_async_service_check_result/handle_async_host_check_result -looking at the code, I have no idea what it will actually end up doingfor each case, but I'm quite sure a few of the code paths are buggy -because that many untested if conditions just aren't going to all becorrect. Modularizing away the if statements (all of them, all over thecore) should render a more consistent, less buggy monitoring solution.

tl;dr: naemon should allow contributors to write modules that are muchmore powerful than today's broker modules, to make it possible and easyto write a module to add seemingly built-in functionality, like metricsand exceptions - then, we could start to write such modules, go crazy,and see what comes out!

Re: [naemon-dev] Ideas about future features

Reply via email to