>Not to hijack the thread, but: to do this in the schema itself confuses the >structure of the data >with the mechanics of its use. I think having a couple of helpers in >JavaScript and PHP > for simple random sampling is sufficient. Much agree with ori here. We would be bloating schema with properties that have nothing to do with data definition.
>Note that – per our data retention guidelines [1] – not all EL data is >expected to be automatically purged within 90 days >(see the section on >“Non-personal information associated with a user account”) I certainly think we should keep performance data (like navigation timing) for longer than 90 days removing pageId and userId if needed. On Wed, May 21, 2014 at 9:03 AM, Ori Livneh <[email protected]> wrote: > > > > On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli > <[email protected]> wrote: >> >> On May 20, 2014, at 10:09 PM, Sean Pringle <[email protected]> wrote: >> >> Hi! >> >> I'd like to hear from stakeholders about purging old data from the >> eventlogging database. Yes, no, why [not], etc. >> >> I understand from Ori that there is a 90 day retention policy, and that >> purging has been discussed previously but not addressed for various reasons. >> Certainly there are many timestamps older than 90 days still in the db, and >> apparently largely untouched by queries? >> >> Perhaps we're in a better position now to do this properly what with data >> now in multiple places: log files, database, hadoop... >> >> Can we please purge stuff? :-) >> >> BR >> Sean >> >> >> Hi Sean, >> >> I sent a similar proposal to the internal list for preliminary feedback >> (see item 2 below) >> >> All, I wanted to hear your thoughts informally (before posting to the >> lists) on two ideas that have been floating around recently: >> >> 1) add support for optional sampling in EventLogging via JSON schemas >> (given the sheer number of teams who have asked for it). See >> https://bugzilla.wikimedia.org/show_bug.cgi?id=65500 > > Not to hijack the thread, but: to do this in the schema itself confuses the > structure of the data with the mechanics of its use. I think having a couple > of helpers in JavaScript and PHP for simple random sampling is sufficient. >> >> >> 2) introduce 90-day pruning by default for all logs, (adding a dedicated >> schema element to override the default). > > Same problem. To illustrate: suppose we're two months into a data collection > job. The researcher carelessly forgot to modify the pruning policy, so it's > set to the default 90 days, whereas the researcher needs it for 180. At this > point our options are: > > 1) Decline to help, even though there's a full month before the pruning > kicks in. > 2) Somehow alter the schema revision without creating a new revision. > EventLogging assumes that schema revisions are immutable and it exploits > this property to provide guarantees about data validity and consistency, so > this is a nonstarter. > 3) Create a new schema revision that declares a 180 day expiration and then > populate its table with a copy of each event logged under the previous > schema. > > The motivation behind your proposal is (I think) a desire to have a unified > configuration interface for data collection jobs. This makes total sense and > it's worth pursuing. I just don't think we should stuff everything into the > schema. The schema is just that: a schema. It's a data model. > > >> >> This would push to the customers the responsibility of ensuring the right >> data is collected and retained. >> >> I understand 2) has already been partly implemented for the raw JSON logs >> (not yet for EL data stored in SQL). Obviously, we would need to audit >> existing logs to make sure that we don’t discard data that needs to be >> retained in a sanitized or aggregate form past 90 days. >> >> >> Note that – per our data retention guidelines [1] – not all EL data is >> expected to be automatically purged within 90 days (see the section on >> “Non-personal information associated with a user account”): many of these >> logs have a status similar to MediaWiki data that is retained in the DB but >> not fully exposed to labs. > > >> >> For this reason, I am proposing that we enable 90-day pruning by default >> for new schemas, with the ability to override the default. > > > Sounds good to me. I figure that the overrides would be specified as > configuration values for the script that does the actual pruning. We could > Puppetize that and document the process for adding exemptions. > >> >> Existing schemas would need to be audited on a case by case basis. > > > By whom? :) Surely not Sean! It'd be great to get this process going. > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
