Re: Sequence file format for Kmeans, LDA, etc.

Grant Ingersoll Fri, 13 Nov 2009 11:54:54 -0800

On Nov 12, 2009, at 8:57 PM, Gregory Lawrence wrote:

> Hi,
> 
> I'm trying to write a map-reduce program that will convert text documents 
> into a format suitable for Mahout's clustering algorithms. From what I can 
> gather, it seems like the output should be a sequence file with a long 
> integer document index (key) and a sparse vector (value) that contains TF (or 
> TFIDF) counts. This sparse vector also has a name that identifies the 
> document.
> 
> Does the long integer document index matter?


No

> I would rather avoid having to set this to something meaningful. Do the 
> numbers have to be unique or contiguous?

This is ignored in the clustering

> Does the name of the sparse vector matter?

Yes, as it is part of the equals() method.

> I noticed that it is being set as a string in LuceneIterable.

Right.  You should be able to model after LuceneIterable and the Driver program 
there.

Also, take a look at what the TfIdfDriver does for the classifier stuff.  This 
is a M/R job for converting text for it's format.  I think we can abstract that 
to be more general purpose and then move it under the Utils module.  The only 
thing that likely needs to change is whether we output the Writable for the 
classifier or whether we output a Vector.  That is my naive view at this point.

-Grant

Re: Sequence file format for Kmeans, LDA, etc.

Reply via email to