Re: Feature Extraction for TU Berlin Winter of Code project

Ted Dunning Sun, 29 Nov 2009 15:25:47 -0800

On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <[email protected]> wrote:


> ...
> Currently we do a rather simple process: compute for each document
> TFIDF of all terms in the corpus. This is implemented straight-forward
> as a two-step map/reduce job. First a map job computes (and serializes
> to HBASE) TF histograms for each document. Then a reduce job computes
> the IDF of all terms occuring in the corpus and serializes the list of
> term/IDF pairs to HDFS. Finally, a third map job uses the serialized
> term/IDF pairs and TF histograms to compute a feature vector for each
> document. So basically, our feature space is the set of all term/IDF
> pairs.
>

You could also use the code in Mahout that allows you to write a Lucene
index as a sequence of document vectors.

In any case, you should look at the format already in use by Mahout tools to
match those to what you do.

I currently see one major issue with this approach: our feature space
> - and thus our feature vectors - will probably get very large when
> many documents are scanned. This will obviously lead to the clustering
> being very slow.


Not necessarily.  If you use sparse vectors, then the dot products required
are pretty fast.


> We probably will have to perform some kind of feature
> reduction during the feature extraction to get smaller - but still
> expressive - feature vectors. One idea would e.g. be to perform PCA on
> the "complete" feature vectors in order to identify dimensions that
> can be pruned. However, this might be computationally too expensive.
> Since I am not very experienced in this field, I hoped that some of
> you could share their thoughts or sugestions on the issue.
>

Read the archives for Mahout-dev.  Several developers are working on SVD
decomposition algorithms that run in parallel to do what you need.

In addition, you could make use of some of the random indexing or simhash
methods.  At the simplest level, simply assign a low dimensional unit length
random vector to each feature.  Then the vector for each document would be
the IDF weighted sum of the vectors for each feature.  If you use a moderate
to high dimensional vector (> 50 dimensions, <500), this will give you nice
fixed length vectors that will pretty accurately preserve dot products of
the original data.  This is pretty much what a sim-hash does, except that
simhashes go one step further and binarize the vectors.

A slightly more involved approach would be to use these same initial random
vectors and update them so that the new vectors for each term or other
feature are the IDF weighted sum of terms or features that occur nearby.
This makes it so that terms that appear in similar contexts will have
similar directions which is a simple step toward getting vectors with
semantic values.  This can be done very efficiently in a single map-reduce
pass over your data.  You would use these feature vectors just like you did
with the sim-hash technique.  This class of techniques is known as random
indexing.

A third approach is to use random projects to make computation of the SVD of
the document x term matrix tractable.  In fact, for your purposes, you can
use a truncated and simplified algorithm that only computes the singular
vectors for the terms which you would then use in similar wise to the first
two methods.  In contrast, you can also use the truncated algorithm to only
compute the singular vectors for the documents without ever computing term
vectors.  This is useful if all you want to do is cluster the documents and
don't really care about the term vectors.  As I mentioned, there should be a
bit of code appearing shortly that does some or all of this.  The speed of
this approach should be quite high.

Re: Feature Extraction for TU Berlin Winter of Code project

Reply via email to