On Thu, Mar 20, 2014 at 03:52:01AM -0700, Ori Livneh wrote: > The Upstart job for EventLogging is configured to re-spawn the writer, up > to a certain threshold of failures. Because the writer repeatedly failed to > connect, it hit the threshold, and was not re-spawned.
This sounds like a bug. A temporary issue (database unavailability, for whatever reason) resulting in a permanent crash of the service needing manual action to restore. This needs to be fixed. > This alert was not responded to. I finally got pinged by Tillman, who > noticed the blog visitor stats report was blank, and by Gilles, who noticed > image loading performance data was missing. > > We have to fix this. The level of maintenance that EventLogging gets is not > proportional to its usage across the organization. Analytics, I really need > you to step up your involvement. I can't comment on the general involvement of analytics in this area, but I do think that responding to Icinga alerts is primarily a techops responsibility. We can and should escalate as necessary and it's obviously always nice & appreciated to see non-ops people lurking around in #wikimedia-operations and jumping in on failures but I don't think I'd blame anyone else for not reacting to an alert. Especially in this case, as anyone with a trivial investigation could just come into the conclusion that a simple restart of the upstart job would fix this (AIUI). > Finally, I think EventLogging Icinga alerts should have a higher profile, > and possibly page someone. Issues can usually be debugged using the > eventloggingctl tool on Vanadium and by inspecting the log files on > vanadium:/var/log/upstart/eventlogging-*. We generally try to keep paging to a minimum. First, for our personal sanities :), but more importantly, because if your phone keeps beeping all day, you become accustomed to it and it will become easier to ignore a "site is down" alert. IMO, pages are for very serious alerts. That doesn't mean that the other (CRITICAL but non-paging) alerts are meant to be ignored for days. In my experience, I see very few opsens actively monitor the Icinga unhandled services page (let alone fix random issues or even their own issues as they see them) and I think we can do better than that. I personally check that page several times within my day, as well as the IRC log, but I do wonder what others do or how they feel about this, especially as we've agreed to scale up the amount of checks (and hence alerts) that we have. Faidon _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
