Hi all,
I'm looking for input on two questions about the raw data files from
the Public Terabyte Dataset project:
1. Target file size. What's the biggest file size that people would
want to handle?
E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB,
etc.
2. Any value to specific grouping of data in files?
E.g. we could try to ensure that all data from the same domain goes
into the same file.
But that might result in individual data files having more skew, and
thus make it harder to get useful results from processing a subset of
the data.
Thanks!
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g