On Mon, Apr 26, 2010 at 10:49 AM, Ken Krugler <kkrugler_li...@transpac.com>wrote:
> Hi all, > > I'm looking for input on two questions about the raw data files from the > Public Terabyte Dataset project: > > 1. Target file size. What's the biggest file size that people would want to > handle? > > E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB, etc. > I like chunks < 1GB if only because moving them over a network involves less wasted effort for failed transfers. > 2. Any value to specific grouping of data in files? > > E.g. we could try to ensure that all data from the same domain goes into > the same file. > > But that might result in individual data files having more skew, and thus > make it harder to get useful results from processing a subset of the data. > Exactly. I would find skewed data a pain the butt for statistical analysis.