I agree that it would be impossible to mask all business confidential aggregate information even if it has no personally identifiable data. Obscuring the data to the degree that would allow those aggregates to be hidden would make the dataset essentially a synthetic one.
So it is a reasonable thing to do to run this graph decomposition on a private cluster, but as you say, not as useful for others to compare against. But in conjunction with a not-quite-record-breaking public dataset that show correct scaling, we would have something that could be used and others would be able to understand the scaling to very large datasets. For that matter, we can provide an algorithm that generates a synthetic dataset of suitable size. If the cost to decompose matches the internal numbers, then that can be the benchmark. The great virtue of that is that we don't have to distribute the dataset, just the program that generates it. A synthetic example alone is of little interest, but in conjunction with measurements and close parallels to a real dataset, it has great interest. On Thu, Feb 25, 2010 at 1:38 PM, Jake Mannix <jake.man...@gmail.com> wrote: > > I will probably end up running this computation internally on our own > hadoop > cluster, but that's not as nice for these purposes for a public data set > record. -- Ted Dunning, CTO DeepDyve