>Are there any reasons to not replace HTTP GET request IP addresses and >proxy information with their SHA-512 secure hash prior to writing them >to permanent media? To expand a bit on Dan's answer. For analytics we need raw IPs to do geo location, which is an important bit of information but other than that we really do not need raw IPs for anything else thus far. It is not unheard of us having to redo our pageview processing due to bugs on code or issues within the pipeline so we need to have raw data available for a certain buffer time.
Now, data needed for ops is a different matter having raw IPs is useful to troubleshoot issues that have to do with connection problems, DOS and others. Normally the work ops does troubleshooting issues with incoming traffic needs IPs to be available for some weeks but not months. Data retention guidelines are documented here: https://meta.wikimedia.org/wiki/Data_retention_guidelines On Thu, Nov 10, 2016 at 7:00 AM, Dan Andreescu <[email protected]> wrote: > So, just to give context, our HTTP requests take this path: > > * varnish log (very small buffer, not permanent) > * varnishkafka > * kafka (small buffer, I think 7 days) > * camus > * refine process (we use IPs at this point to geolocate) > * webrequest table on hdfs (this is the first time they're stored on > permanent media, for 60 days) > * other datasets like hourly pageviews aggregates, (IPs are not passed on > to these) > > So if we wanted to not store them in kafka buffers even, we'd have to give > up geolocating. I think a lot of people find this very useful > (fundraising, research, ops, reading), so it's unlikely to be removed. > > I don't have as clear a reason for why we store the plain IP in > webrequest. I think we could count uniques and all that other stuff with > the IP hash. It's a good question, tentative +1 unless I'm forgetting > something. But even so, it's not so bad, it's only stored for 60 days and > we have no other plain IPs anywhere else (like we removed them from Event > Logging for example). > > On Tue, Nov 8, 2016 at 4:26 PM, James Salsman <[email protected]> wrote: > >> Are there any reasons to not replace HTTP GET request IP addresses and >> proxy information with their SHA-512 secure hash prior to writing them >> to permanent media? >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
