On Nov 12, 2009, at 8:57 PM, Gregory Lawrence wrote: > Hi, > > I'm trying to write a map-reduce program that will convert text documents > into a format suitable for Mahout's clustering algorithms. From what I can > gather, it seems like the output should be a sequence file with a long > integer document index (key) and a sparse vector (value) that contains TF (or > TFIDF) counts. This sparse vector also has a name that identifies the > document. > > Does the long integer document index matter?
No > I would rather avoid having to set this to something meaningful. Do the > numbers have to be unique or contiguous? This is ignored in the clustering > Does the name of the sparse vector matter? Yes, as it is part of the equals() method. > I noticed that it is being set as a string in LuceneIterable. Right. You should be able to model after LuceneIterable and the Driver program there. Also, take a look at what the TfIdfDriver does for the classifier stuff. This is a M/R job for converting text for it's format. I think we can abstract that to be more general purpose and then move it under the Utils module. The only thing that likely needs to change is whether we output the Writable for the classifier or whether we output a Vector. That is my naive view at this point. -Grant
