So, just to give context, our HTTP requests take this path:

* varnish log (very small buffer, not permanent)
* varnishkafka
* kafka (small buffer, I think 7 days)
* camus
* refine process (we use IPs at this point to geolocate)
* webrequest table on hdfs (this is the first time they're stored on
permanent media, for 60 days)
* other datasets like hourly pageviews aggregates, (IPs are not passed on
to these)

So if we wanted to not store them in kafka buffers even, we'd have to give
up geolocating.  I think a lot of people find this very useful
(fundraising, research, ops, reading), so it's unlikely to be removed.

I don't have as clear a reason for why we store the plain IP in
webrequest.  I think we could count uniques and all that other stuff with
the IP hash.  It's a good question, tentative +1 unless I'm forgetting
something.  But even so, it's not so bad, it's only stored for 60 days and
we have no other plain IPs anywhere else (like we removed them from Event
Logging for example).

On Tue, Nov 8, 2016 at 4:26 PM, James Salsman <[email protected]> wrote:

> Are there any reasons to not replace HTTP GET request IP addresses and
> proxy information with their SHA-512 secure hash prior to writing them
> to permanent media?
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to