On Nov 29, 2009, at 11:25 PM, Ted Dunning wrote: > On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <[email protected]> wrote: > >> ... >> Currently we do a rather simple process: compute for each document >> TFIDF of all terms in the corpus. This is implemented straight-forward >> as a two-step map/reduce job. First a map job computes (and serializes >> to HBASE) TF histograms for each document. Then a reduce job computes >> the IDF of all terms occuring in the corpus and serializes the list of >> term/IDF pairs to HDFS. Finally, a third map job uses the serialized >> term/IDF pairs and TF histograms to compute a feature vector for each >> document. So basically, our feature space is the set of all term/IDF >> pairs. >> > > You could also use the code in Mahout that allows you to write a Lucene > index as a sequence of document vectors. > > In any case, you should look at the format already in use by Mahout tools to > match those to what you do.
There Categorization stuff also has a M/R ready TF/IDF calculator. It would be nice to see this abstracted from the categorization stuff and just used to produce various outputs as needed. -Grant
