>
> Is there a catalog of all data that could possibly be available (for
> instance, the mw.session cookie), along with where it is logged, for
> how long, and where in various toolchains it gets stripped out?
>
> Not to my knowledge. So, in terms of /readers/, we deliberately have very
little. Possible vectors I'm aware of:

-the mw.session cookie. This is stripped out of varnishlog before it even
gets to the analytics machines, so presumably doesn't make it past udp2log.
-EventLogging data. For example, data to test how our caching or module
storage is working. We've got some of this for the time period I analysed,
and I'm planning on using the module storage data to test the algorithm,
since it contains a unique identifier independent of IP/UA. This sort of
information is gathered for specific tasks, though, rather than by default,
which I'm kind of happy with: if the existing algorithm is valid I don't
really want to see more PII in our logs. If not, eh, we'll assess how
important session data is outside of academia.
-the UA/IP/lang data
-...that's it.

Obviously these are "vectors I'm aware of" - I am fully open to being
corrected by someone more informed than myself.


> Related lists could be useful for planning:
> * Limitations our privacy policies place on data gathering (handy when
> reviewing those policies)
>
Indeed; the analytics team is working out how we address data retention as
we speak.

> * Studies that are easy and hard given the types of data we gather
> * Wishlists (from external researchers, and from internal staff) of
> data-sets that would be useful but aren't currently available.  Along
> with a sense of priority, complexity, cost.
>
Yep, these thought experiments are being factored into our data retention
discussion.
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to