Hi all,

I'm looking for input on two questions about the raw data files from the Public Terabyte Dataset project:

1. Target file size. What's the biggest file size that people would want to handle?

E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB, etc.

2. Any value to specific grouping of data in files?

E.g. we could try to ensure that all data from the same domain goes into the same file.

But that might result in individual data files having more skew, and thus make it harder to get useful results from processing a subset of the data.

Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to