On 2013-12-27 02:06, Matthias Eble wrote:
1) have a feature to monitor per metric rather than per check_command.
    * Today, many plugins check lots of things.
    * Typical example is check_disk, check_snmp.
    * Depending on the configuration method, acknowledging a problem
with /mytmpmount also disables notifications for /var
      * To fix that, we'd need to create a stricter plugin output
standard that contains per-metric status codes.
         * metrics would be /mytmpmount-freespace, TCP-response-time,
http-status-code or http-match-string
      * The core would need to create sub-services at run-time and
populate their results.
      * Benefit: per metric actions and logging. Especially per metric
downtime and acknowledgements
      * Maybe it could also be used for receiving snmp traps or log
pattern matching checks
         * different alerting for different patterns/traps

    * Today, many folks wrap the plugin call and submit results to a
passive check.
       * works, but all possible services need to be in the config.
       * That's where folks start generating nagios configs and reload
the daemon.
       * Is that what we want? Maybe?
          * problems arise when there are syntax problems

    * raw proposal:
   define service {
   ...
       check_command  check_disk
     contact_group  os_admins
     define metric {
         metric_name  ^/oracle.*
         contact_group oracle_admins
     }
   }

    * maybe another layer could be added for check_multi-like plugins.
       * but they could also be forced to structure metric names

I've been thinking about plugins and plugin architecture a bit.

The nagiosplugins project is talking about a new threshold format - https://www.nagios-plugins.org/doc/new-threshold-syntax.html - to achieve the same thing you want to solve in-core.

I think the nagiosplugins approach - basically, update all plugins to support a much more complex (though easier to understand) threshold format, because the old one was too complicated - is wrong. Programmers write buggy code, and telling programmers to write more code leads to more bugs (or at least I write buggy code, and I'm too stupid to write plugins already - that's why I stick to the core :P )

But I'm also not sure how far into the core the I'd want to put it. What if we, instead of either change the core or the plugins, write a plugin wrapper that takes a threshold as described by nagiosplugins and a plugin command line? It would simply parse the perfdata from the plugin, the threshold from the CLI, throw away the plugin exit code, and send a new, "imploved" exit code and stdout to naemon?

I feel this plugin wrapper approach would take the least amount of work to implement. Which problems would it leave unsolved?

> [snip]
What do you think? What's the focus of the dev-team?

So far, it seems the focus is mostly on cleanup. There's just *so* *much* ancient *crap* lying around. Tens of thousands of lines of code to create a ugly, useless web UI, which force me to ifdef every second line - what? Three (or so, I lost count) different configuration parsers for near-identical-yet-subtly-different configuration file formats - really? And then the amount of special casing for things that you might expect to behave similarly until you find out the hard way that they really don't - for instance, I always thought the flapping calculation was based on the last 20 (or so) check results, but nooo: https://github.com/naemon/naemon-core/blob/master/naemon/flapping.c#L116 I think the current score is something like -120k lines compared to the initial code import, but there's a lot more we could do.

Oh, and testing. One of the scariest things when starting on a new codebase is realizing that there are no tests at all. The only thing worse than that is finding a directory full of tests - granted, all covered in cobweb and dust - and you think (or hope, or whatever you call that feeling when you know you must never assume good things but still want to) that you might have found The Book of Shadows in the attic, but after dusting the code off and flipping through it, it turns out that nobody has executed (or even compiled) any code here for *years*, and half the tests test features that doesn't even exist anymore. It dawns on you that somebody spent days - weeks, even - writing tests to avoid regressions - and then didn't run the test and thus didn't catch the regressions. See: t-tap, where a few of the tests files actually work, and none of them have a working build system ATM.

As far as I'm going to go in terms of longer-term vision and the-way-to-go-iness, I'd like to modularize the crap out of the core. The nagios "core" is anything but, as explained above. It would be neat to lift out a bunch of nagios functionality into a bundle of preinstalled modules. This would serve two purposes: it would force us to dogfood the broker API and thus help us improve it, and it would compartmentalize features (new, and old) to avoid weird interactions with other features.

The broker API as it exists is terrible - you're just given all of the naemon internals, spotty and inconsistent hooks, and a "good luck". This means that, as a core developer, any change I make at all is bound to break some module, while as a module author, I need to learn all of the core to write a module. And you want to store your own add-on configuration/data? Hah! So, in the end, it's just easier to become a core contributor, because who has the time not to?

What would happen if, to take an example that sounds weird but makes some kind of sense, the flapping functionality was a module? That would require some extra module functionality - modules would have to be able to add configuration statements to the config (global and per-object) for configuring flapping thresholds, and modules would have to be able to couple state (is_flapping, last 20 check results) with the object and have it persist between restarts. Now, what if this was the easiest, most concise, and easiest-to-find-out-how way to do it?

I think a module should be able to do all these things - and if it could do that, and if flapping was a module, I would not ever again have to worry about flapping in the remaining core, nor would I wonder where all special cases for flapping are handled - heck, I could even see if the flapping feature has tests and how extensive they are, just from looking at github.com/naemon/flapping ! Today, almost all features - including flapping - is handled by the pair of ogres known as handle_async_service_check_result/handle_async_host_check_result - looking at the code, I have no idea what it will actually end up doing for each case, but I'm quite sure a few of the code paths are buggy - because that many untested if conditions just aren't going to all be correct. Modularizing away the if statements (all of them, all over the core) should render a more consistent, less buggy monitoring solution.

tl;dr: naemon should allow contributors to write modules that are much more powerful than today's broker modules, to make it possible and easy to write a module to add seemingly built-in functionality, like metrics and exceptions - then, we could start to write such modules, go crazy, and see what comes out!

Reply via email to