On Thu, Mar 20, 2014 at 3:49 PM, Toby Negrin <[email protected]> wrote:

> We will work with Ori to understand what level of effort is required to
> support EventLogging. It's likely that Analytics and techops (and Ori) will
> need to collaborate on what will need to be done.
>

* The Ganglia scripts need to be fixed.
* A daily report should go out reporting the number of valid and invalid
events logged, broken down by schema.
* Someone needs to scan that report for anything usual, file bugs for code
that violates its data model, and follow-up with the relevant team to
ensure a fix.
* Alerts need to be responded to.
* Once a month, the backup process (vanadium -> stat1001 -> tridge) should
get a quick lookover to ensure that it is functioning.
* Once every six months, a drill should be conducted to test system
failover and recovery procedures.
* There should be a designated person to provide technical advice and
Gerrit code review for new instrumentation code. (This has already scaled
beyond just me -- folks like Matt F, Yuvi, Jon, Bryan, etc. have the
requisite expertise. But someone needs to own this, and be accountable that
code review happens in a prompt fashion.)
* Bugs reported in Bugzilla should be acknowledged and resolved.

Toby, I think you guys have the requisite talent and capacity to handle it
internally.
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to