Re: Input on PTD dataset results

Ted Dunning Mon, 26 Apr 2010 12:15:20 -0700

On Mon, Apr 26, 2010 at 10:49 AM, Ken Krugler
<kkrugler_li...@transpac.com>wrote:


> Hi all,
>
> I'm looking for input on two questions about the raw data files from the
> Public Terabyte Dataset project:
>
> 1. Target file size. What's the biggest file size that people would want to
> handle?
>
> E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB, etc.
>

I like chunks < 1GB if only because moving them over a network involves less
wasted effort for failed transfers.


> 2. Any value to specific grouping of data in files?
>
> E.g. we could try to ensure that all data from the same domain goes into
> the same file.
>
> But that might result in individual data files having more skew, and thus
> make it harder to get useful results from processing a subset of the data.
>

Exactly.  I would find skewed data a pain the butt for statistical analysis.

Re: Input on PTD dataset results

Reply via email to