Re: Feature Extraction for TU Berlin Winter of Code project

Grant Ingersoll Wed, 02 Dec 2009 11:21:49 -0800

On Nov 29, 2009, at 11:25 PM, Ted Dunning wrote:

> On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <[email protected]> wrote:
> 
>> ...
>> Currently we do a rather simple process: compute for each document
>> TFIDF of all terms in the corpus. This is implemented straight-forward
>> as a two-step map/reduce job. First a map job computes (and serializes
>> to HBASE) TF histograms for each document. Then a reduce job computes
>> the IDF of all terms occuring in the corpus and serializes the list of
>> term/IDF pairs to HDFS. Finally, a third map job uses the serialized
>> term/IDF pairs and TF histograms to compute a feature vector for each
>> document. So basically, our feature space is the set of all term/IDF
>> pairs.
>> 
> 
> You could also use the code in Mahout that allows you to write a Lucene
> index as a sequence of document vectors.
> 
> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.


There Categorization stuff also has a M/R ready TF/IDF calculator.  It would be 
nice to see this abstracted from the categorization stuff and just used to 
produce various outputs as needed.

-Grant

Re: Feature Extraction for TU Berlin Winter of Code project

Reply via email to