>Agreed, on both points, but there is a big difference between >"logically we can reason this is the case" and "we have proven that >this is the case, and it impacts different groups in these >proportions, and etc etc etc".
I see. Very true. > which means even things we all know to be true (like the mobile point) need validation. While I do not see anything wrong with documenting and quantifying it, it is worth to have in mind that mobile is a different case. The sharing of IPs across many users is common due to mobile protocols use of NAT-ing: https://en.wikipedia.org/wiki/Network_address_translation Take a look at: http://stackoverflow.com/questions/10946624/finding-ip-address-for-iphone On Tue, Jan 5, 2016 at 10:31 AM, Oliver Keyes <[email protected]> wrote: > On 5 January 2016 at 13:01, Nuria Ruiz <[email protected]> wrote: > >>So, the goal is to have a UUID _distinct_ from IP and user agent (that > is, > >> the IP and UA are not related to the UUID that's generated) so > >>that that UUID can be used as a baseline for accuracy purposes. > > > > I understand. But let me re-explain: my point was mentioning that > regarding > > #2 (decay) we already know that the IP + UA combo in many instances > decays > > real slowly, so the long tail is very significant and that we really do > not > > need a token to prove this fact. > > > > I just wanted to mention research that has already been done so you have > it > > also as a reference and we do not duplicate work. > > > > > >>so much as "does a user_agent/ip hash make a good UUID, generally". > > Depends on what "generally" means, in mobile the answer is most > definitely > > no. Again, you do not need a token to prove this fact, as mobile > providers > > use sometimes a short IP range for tens of thousands of customers. > > > > > > > > Agreed, on both points, but there is a big difference between > "logically we can reason this is the case" and "we have proven that > this is the case, and it impacts different groups in these > proportions, and etc etc etc". The goal is not just to provide a > reference point for internal use but also to write it up for > publication so it can be used more generally, which means even things > we all know to be true (like the mobile point) need validation. > > > > > On Sun, Jan 3, 2016 at 9:48 AM, Oliver Keyes <[email protected]> > wrote: > >> > >> Hey Nuria, > >> > >> So, the goal is to have a UUID _distinct_ from IP and user agent (that > >> is, the IP and UA are not related to the UUID that's generated) so > >> that that UUID can be used as a baseline for accuracy purposes. Think > >> the UUID in the ModuleStorage test datasets from wayback. So it's not > >> "can any individual user be de-aggregated" so much as "does a > >> user_agent/ip hash make a good UUID, generally". If I'm understanding > >> that page correctly, it's more aimed at the former problem. > >> > >> On 3 January 2016 at 11:29, Nuria <[email protected]> wrote: > >> > Oliver, > >> > > >> > You might want to check our documentation in wikitech regarding > identity > >> > reconstruction. I think it covers your point #1. > >> > > >> > > >> > > >> > > https://wikitech.wikimedia.org/wiki/Analytics/Data/Preventing_identity_reconstruction > >> > > >> > Nuria > >> > > >> > > >> > > >> > On Jan 2, 2016, at 10:00 AM, Oliver Keyes <[email protected]> > wrote: > >> > > >> > Hey y'all > >> > > >> > I'm working on a piece of research (largely recreational) on the old > >> > problem of fingerprinting users with minimal information - namely the > >> > combination of a user agent and an IP address. Basically I'm looking > >> > to put together a piece of work showing: > >> > > >> > 1. How sub-standard it is; > >> > 2. How fast it decays; > >> > 3. How the sub-standardness varies by (platform|location) > >> > > >> > This would be pretty doable with internal data; basically I'd need a > >> > schema with IP, user agent and a per-user UUID that's got a decent > >> > (>=24 hours) expiry time. My question: does anyone know of a table > >> > with recent data that meets these requirements? And, if not, anyone > >> > with EventLogging experience interested in working on the problem with > >> > me? > >> > > >> > -- > >> > Oliver Keyes > >> > Count Logula > >> > Wikimedia Foundation > >> > > >> > _______________________________________________ > >> > Analytics mailing list > >> > [email protected] > >> > https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > >> > > >> > _______________________________________________ > >> > Analytics mailing list > >> > [email protected] > >> > https://lists.wikimedia.org/mailman/listinfo/analytics > >> > > >> > >> > >> > >> -- > >> Oliver Keyes > >> Count Logula > >> Wikimedia Foundation > >> > >> _______________________________________________ > >> Analytics mailing list > >> [email protected] > >> https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
