Petr Bena <[email protected]> wrote:

> I think it's a time to finally make it possible for users to create
> own check for nagios (icinga).

> I will try to document how current icinga is setup on
> https://wikitech.wikimedia.org/wiki/Icinga/Labs

> Currently there is a nagiosbuilder which is a python script made by
> Damian which query the ldap and build nagios cfg files based on that.

> My idea is to create configuration files / templates for this
> nagiosbuilder so that it would apply different options for certain
> hosts based on this configuration. Users would just

>  * create own check, place it somewhere on the server which they want
> to monitor the service at and insert it to nrpe
> (/etc/nagios/nrpe.d/yourservice.cfg)
> * use some interface (to be discussed) to insert this check for specific host

> and nagiosbuilder would

> * query that interface in order to generate configuration files
> * based on this config would set up custom services for these hosts

> For us (developers) it's most easy to use gerrit as this interface, so
> that people would directly update these configuration files used by
> nagiosbuilder, however that pretty much suck, so I think it would be
> better to create a new interface into labsconsole, so that people can
> define their nagios checks directly as a property of each node.

> That of course would require more coding and ops assistance but I
> think it's doable. Some opinions?

a) Great to see some progress on that front :-).

b) I think differences between production and Labs hosts
   should be minimal.  If we have to add (and sync!) checks
   manually, it's gonna be a big mess.  If I configure an
   instance with the class redis, that class's monitoring
   should be used without any further intervention.  (If I
   set up my own puppetmaster and use changes not committed
   to the WMF repository yet, that would be an acceptable
   exception.)

I assume this is most important for Beta
(cf. https://bugzilla.wikimedia.org/51497), but all other
projects managed by operations/puppet would profit from that
as well.  Comment #2 of that bug refers to security reasons
for why we can't copy the production setup verbatim, but be-
fore we introduce another system, I think we should try to
mitigate those concerns first, i. e. see what users can fid-
dle with without review and what impact that has on the
processes run on the puppetmaster and/or Icinga.

What we will need some sort of UI for is the configuration
of alerts.  I assume not every project wants to receive
mails or IRC messages whenever something's broken, and even
in one project there may be different levels or interests.

Tim


_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l

Reply via email to