On Thu, Feb 25, 2010 at 12:42 PM, Grant Ingersoll <gsing...@apache.org>wrote:
> > I'd be a little wary of that and I'd hate to see anything happen to it (AOL > comes to mind). That being said, if you just export the vectors w/o the > key, it really is pretty anonymous. What other sources can we get? > It's not really anonymous even in that case, because clustering which can be used on the data can extract out unique identifiers pretty easily (if people guess central hubs, the look at degree of connectivity between them to clarify that further, etc...). You'd be surprised (or maybe not) how much information leaks through if a simple anonymization step is used. On further thought, I think that no matter how much anonymization of the user-base is done, the amount of not-public information about the graph as a whole would be revealed at a glance which is not my position to allow out (density of the graph, degree of connectivity, number of connections per user as a histogram, etc). I will probably end up running this computation internally on our own hadoop cluster, but that's not as nice for these purposes for a public data set record. -jake