On Thu, Feb 25, 2010 at 12:42 PM, Grant Ingersoll <gsing...@apache.org>wrote:

>
> I'd be a little wary of that and I'd hate to see anything happen to it (AOL
> comes to mind).  That being said, if you just export the vectors w/o the
> key, it really is pretty anonymous.    What other sources can we get?
>

It's not really anonymous even in that case, because clustering which can be
used on the data can extract out unique identifiers pretty easily  (if
people guess central hubs, the look at degree of connectivity between them
to clarify that further, etc...).  You'd be surprised (or maybe not) how
much information leaks through if a simple anonymization step is used.

On further thought, I think that no matter how much anonymization of the
user-base is done, the amount of not-public information about the graph as a
whole would be revealed at a glance which is not my position to allow out
(density of the graph, degree of connectivity, number of connections per
user as a histogram, etc).

I will probably end up running this computation internally on our own hadoop
cluster, but that's not as nice for these purposes for a public data set
record.

  -jake

Reply via email to