Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Ken Krugler Tue, 03 Nov 2009 18:45:25 -0800

I assume there would also be an issue of which tokenizer to use tocreate the terms from the text.

And possibly issues around storing separate vectors for (at least)title vs. content?


Anybody have input on either of these?

Thanks,

-- Ken

On Nov 3, 2009, at 10:14am, Jake Mannix wrote:

Well the minimum size, for the IntDoubleVector which isn't yet intrunk(it's on Ted's patch which hasn't worked its way in yet) wouldentail one
int and one double per unique term in the document, so that's 12 bytes
each.  Typical documents have lots of repeat terms, but most terms are
smaller than 12 bytes as well... so the fraction is probably morethan 10%and less than 50% is my guess. But I'm sure others around here havemore
experience producing large vector sets out of the text in Mahout.

-jake
On Tue, Nov 3, 2009 at 7:49 AM, Ken Krugler <[email protected]>wrote:
On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:

Might be of interest to all you Mahouts out there...
http://bixolabs.com/datasets/public-terabyte-dataset-project/
Would be cool to get this converted over to our vector format sothat we
can cluster, etc.
How much additional space would be required for the vectors, in some
optimal compressed format? Say as a percentage of raw text size.
I'm asking because I have some flexibility in the processing andassociated
metadata I can store as part of the dataset.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Reply via email to