On May 20, 2014, at 10:09 PM, Sean Pringle <[email protected]> wrote:
> Hi! > > I'd like to hear from stakeholders about purging old data from the > eventlogging database. Yes, no, why [not], etc. > > I understand from Ori that there is a 90 day retention policy, and that > purging has been discussed previously but not addressed for various reasons. > Certainly there are many timestamps older than 90 days still in the db, and > apparently largely untouched by queries? > > Perhaps we're in a better position now to do this properly what with data now > in multiple places: log files, database, hadoop... > > Can we please purge stuff? :-) > > BR > Sean Hi Sean, I sent a similar proposal to the internal list for preliminary feedback (see item 2 below) > All, I wanted to hear your thoughts informally (before posting to the lists) > on two ideas that have been floating around recently: > > 1) add support for optional sampling in EventLogging via JSON schemas (given > the sheer number of teams who have asked for it). See > https://bugzilla.wikimedia.org/show_bug.cgi?id=65500 > > 2) introduce 90-day pruning by default for all logs, (adding a dedicated > schema element to override the default). > > This would push to the customers the responsibility of ensuring the right > data is collected and retained. > > I understand 2) has already been partly implemented for the raw JSON logs > (not yet for EL data stored in SQL). Obviously, we would need to audit > existing logs to make sure that we don’t discard data that needs to be > retained in a sanitized or aggregate form past 90 days. > Note that – per our data retention guidelines [1] – not all EL data is expected to be automatically purged within 90 days (see the section on “Non-personal information associated with a user account”): many of these logs have a status similar to MediaWiki data that is retained in the DB but not fully exposed to labs. For this reason, I am proposing that we enable 90-day pruning by default for new schemas, with the ability to override the default. Existing schemas would need to be audited on a case by case basis. Dario [1] https://meta.wikimedia.org/wiki/Data_retention_guidelines
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
