On Sun, Aug 15, 2010 at 09:54:44PM -0600, Christer Edwards wrote: > Based on our downtime this past evening I took an interest in our > current monitoring solution (if you could call it that). The details I > found are listed below, and I think clear up some misconceptions I've > (we've) had about this box. > > signal.gnome.org is, as we know, hosted at OSUDL. It is a 2cpu VM > (QEMU Virtual CPU version 0.11.1), with 256M RAM and about 7.5G > storage. Currently it is running nagios3 on apache 1.3 and mysql > server (a requirement of nagios3?). > > The current monitoring configuration is poor and looks like it has > been for some time. It is only monitoring a handful of services, the > key services not even configured properly. As an example, > window.gnome.org HTTP service: down 246d 16h 33m 12s. Most configured > services are like this. It's mostly red across the board, and I'm sure > it's simply misconfiguration.
At one time it contained an outdated Nagios and a Nagios3 version. Both were running. It kept sending out mails even though we acknowledged that a machine was permanently down (box.gnome.org). So repeat emails have been disabled, etc. I guess it still has old IP addresses in the configuration and we didn't notice due to lack of repeat emails. > It'll take a little bit of work but it can be cleaned up to provide > rudimentary monitoring without a lot of work. This is what I'd like to > do: > > 1) update to apache2 (why is it even on apache 1.3??) Old VM. > 2) define as a group the critical services we want monitored (I'd > suggest http for bugzilla and the wiki for starters) > 3) configure SSL for the signal webserver. Auth is done by htpasswd. > We all know plain text is bad. > 4) configure the nagios3 path as the default DocumentRoot. Currently / > shows some generic message, the wiki points to /nagios/, but the > actual monitoring is at /nagios3/ > 5) as an extra, perhaps add a DNS cname/alias for 'nagios.gnome.org' > which points to signal. > 6) /etc/aliases only defines specific admins as email recipients. I > think these should be sent team-wide. That should be kept up to date, yes. It should NOT send stuff out to gnome-sysadmin, as then we miss out on a lot of downed stuff. Only directly to admins as otherwise people might rely on the gnome-sysadmin nagios bits. Not sure how to keep that list up to date. It is not connected to LDAP due to history + might intervene with monitoring. Would be nice to have an announcement bot in irc.gnome.org, #sysadmin (+configure it to repeat the downed machines) > All of this would take me maybe a couple hours tomorrow. I'm > interested in any other feedback re: services monitored, notification > methods (emails to specific sysadmins per-host? emails to -sysadmin? > emails to -infrastructure?) Nothing to gnome-sysadmin, nor gnome-infrastructure. It should only send stuff outside @gnome.org. This as announcement of downed stuff should not rely on any infrastructure which might have issues. Would be nice if it sends test emails to itself (signal->menubar->signal). So we know when there is an email problem (sometimes clamav / amavisd has issues). In those cases postfix works ok, it just doesn't send stuff anymore. > In the meantime I'll get started on some basic maintenance, such as > fixing the monitoring that is there. Cool! -- Regards, Olav _______________________________________________ gnome-infrastructure mailing list [email protected] http://mail.gnome.org/mailman/listinfo/gnome-infrastructure
