Regarding your last point, it seems like a mail is sent in this case, to
the following users:
define contactgroup {
contactgroup_name analytics
members dvanliere,ezachte,dtaraborelli,otto,milimetric
}
and the role includes this group in the contacts:
nrpe::monitor_service { 'eventlogging':
ensure => 'present',
description => 'Check status of defined EventLogging jobs',
nrpe_command => '/usr/lib/nagios/plugins/check_eventlogging_jobs',
require => File['/usr/lib/nagios/plugins/check_eventlogging_jobs'],
contact_group => 'admins,analytics',
}
If someone is missing from this list or the check needs to be added to
another service, i'll be glad to do it.
Matanya
On 2014-03-20 13:50, Dan Andreescu wrote:
> Thank you for the detailed write-up Ori
>
>> We have to fix this. The level of maintenance that EventLogging gets is not
>> proportional to its usage across the organization. Analytics, I really need
>> you to step up your involvement.
>>
>> It was not long ago that EventLogging was running reliably for months at a
>> time. What has changed is not system load, but the owner seat becoming
>> vacant, leading to a gradual deterioration of the quality of monitoring and
>> auditing practices.
>
> Indeed, the owner seat is vacant. According to a recent discussion on the
> analytics list, we did not yet consider ourselves the proper owners of
> EventLogging. Our sprint planning is today and I'll bring it up and note its
> importance in light of this down time.
>
>> Sean proposed moving the EventLogging database to m2, so that it runs on
>> separate hardware from the research databases. I think he's right. I filed
>> <https://rt.wikimedia.org/Ticket/Display.html?id=7081 [1]> to request the
>> migration.
>
> Thank you, I support isolation.
>
>> Finally, I think EventLogging Icinga alerts should have a higher profile,
>> and possibly page someone. Issues can usually be debugged using the
>> eventloggingctl tool on Vanadium and by inspecting the log files on
>> vanadium:/var/log/upstart/eventlogging-*.
>
> I think this is the key reason the failure was ignored, so I agree here. We
> should at the very least forward these alerts as an email to analytics devs.
> I have no idea how to do that, if anyone would like to help that'd be great.
>
> _______________________________________________
> Ops mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/ops [2]
Links:
------
[1] https://rt.wikimedia.org/Ticket/Display.html?id=7081
[2] https://lists.wikimedia.org/mailman/listinfo/ops
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics