On May 20, 2014, at 10:09 PM, Sean Pringle <[email protected]> wrote:

> Hi!
> 
> I'd like to hear from stakeholders about purging old data from the 
> eventlogging database. Yes, no, why [not], etc.
> 
> I understand from Ori that there is a 90 day retention policy, and that 
> purging has been discussed previously but not addressed for various reasons. 
> Certainly there are many timestamps older than 90 days still in the db, and 
> apparently largely untouched by queries?
> 
> Perhaps we're in a better position now to do this properly what with data now 
> in multiple places: log files, database, hadoop...
> 
> Can we please purge stuff? :-)
> 
> BR
> Sean

Hi Sean, 

I sent a similar proposal to the internal list for preliminary feedback (see 
item 2 below)

> All, I wanted to hear your thoughts informally (before posting to the lists) 
> on two ideas that have been floating around recently:
> 
> 1) add support for optional sampling in EventLogging via JSON schemas (given 
> the sheer number of teams who have asked for it). See 
> https://bugzilla.wikimedia.org/show_bug.cgi?id=65500
> 
> 2) introduce 90-day pruning by default for all logs, (adding a dedicated 
> schema element to override the default).
> 
> This would push to the customers the responsibility of ensuring the right 
> data is collected and retained.
> 
> I understand 2) has already been partly implemented for the raw JSON logs 
> (not yet for EL data stored in SQL). Obviously, we would need to audit 
> existing logs to make sure that we don’t discard data that needs to be 
> retained in a sanitized or aggregate form past 90 days.
> 

Note that – per our data retention guidelines [1] – not all EL data is expected 
to be automatically purged within 90 days (see the section on “Non-personal 
information associated with a user account”): many of these logs have a status 
similar to MediaWiki data that is retained in the DB but not fully exposed to 
labs. For this reason, I am proposing that we enable 90-day pruning by default 
for new schemas, with the ability to override the default. Existing schemas 
would need to be audited on a case by case basis.

Dario


[1] https://meta.wikimedia.org/wiki/Data_retention_guidelines

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to