Input on PTD dataset results

Hi all,

I'm looking for input on two questions about the raw data files fromthe Public Terabyte Dataset project:

1. Target file size. What's the biggest file size that people wouldwant to handle?

E.g. we could generate 1000 chunks of 1GB each, or 100 chunks of 10GB,etc.


2. Any value to specific grouping of data in files?

E.g. we could try to ensure that all data from the same domain goesinto the same file.

But that might result in individual data files having more skew, andthus make it harder to get useful results from processing a subset ofthe data.


Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Reply via email to