> > Is there a catalog of all data that could possibly be available (for > instance, the mw.session cookie), along with where it is logged, for > how long, and where in various toolchains it gets stripped out? > > Not to my knowledge. So, in terms of /readers/, we deliberately have very little. Possible vectors I'm aware of:
-the mw.session cookie. This is stripped out of varnishlog before it even gets to the analytics machines, so presumably doesn't make it past udp2log. -EventLogging data. For example, data to test how our caching or module storage is working. We've got some of this for the time period I analysed, and I'm planning on using the module storage data to test the algorithm, since it contains a unique identifier independent of IP/UA. This sort of information is gathered for specific tasks, though, rather than by default, which I'm kind of happy with: if the existing algorithm is valid I don't really want to see more PII in our logs. If not, eh, we'll assess how important session data is outside of academia. -the UA/IP/lang data -...that's it. Obviously these are "vectors I'm aware of" - I am fully open to being corrected by someone more informed than myself. > Related lists could be useful for planning: > * Limitations our privacy policies place on data gathering (handy when > reviewing those policies) > Indeed; the analytics team is working out how we address data retention as we speak. > * Studies that are easy and hard given the types of data we gather > * Wishlists (from external researchers, and from internal staff) of > data-sets that would be useful but aren't currently available. Along > with a sense of priority, complexity, cost. > Yep, these thought experiments are being factored into our data retention discussion.
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
