I agree that it would be impossible to mask all business confidential
aggregate information even if it has no personally identifiable data.
Obscuring the data to the degree that would allow those aggregates to be
hidden would make the dataset essentially a synthetic one.

So it is a reasonable thing to do to run this graph decomposition on a
private cluster, but as you say, not as useful for others to compare
against.

But in conjunction with a not-quite-record-breaking public dataset that show
correct scaling, we would have something that could be used and others would
be able to understand the scaling to very large datasets.

For that matter, we can provide an algorithm that generates a synthetic
dataset of suitable size.  If the cost to decompose matches the internal
numbers, then that can be the benchmark.  The great virtue of that is that
we don't have to distribute the dataset, just the program that generates
it.  A synthetic example alone is of little interest, but in conjunction
with measurements and close parallels to a real dataset, it has great
interest.

On Thu, Feb 25, 2010 at 1:38 PM, Jake Mannix <jake.man...@gmail.com> wrote:

>
> I will probably end up running this computation internally on our own
> hadoop
> cluster, but that's not as nice for these purposes for a public data set
> record.




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to