Re: Feature Extraction for TU Berlin Winter of Code project

Max Heimel Tue, 01 Dec 2009 10:07:08 -0800

Hi Ted,

first thanks for your answer.


> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.
>
> I currently see one major issue with this approach: our feature space
>> - and thus our feature vectors - will probably get very large when
>> many documents are scanned. This will obviously lead to the clustering
>> being very slow.
>
>
> Not necessarily.  If you use sparse vectors, then the dot products required
> are pretty fast.

Yes, we are already using the Mahout provided sparse vector implementation.

> In addition, you could make use of some of the random indexing or simhash
> methods.  At the simplest level, simply assign a low dimensional unit length
> random vector to each feature.  Then the vector for each document would be
> the IDF weighted sum of the vectors for each feature.  If you use a moderate
> to high dimensional vector (> 50 dimensions, <500), this will give you nice
> fixed length vectors that will pretty accurately preserve dot products of
> the original data.

Random Indexing sounds like a really interesting technique for feature
extraction in the tuwoc project.
I think this approach combined with a more selective data cleansing
and feature selection should provide us with reasonable feature
vectors. Thanks for the hint!

Cheers
Max

PS:
Do you have further details on the SVD implementation you were
mentioning  (e.g. a paper detailing the approach). I once implemented
a streaming SVD algorithm for Cell and would be interested to see the
algorithm that was used for Mahout. Thanks.


On Mon, Nov 30, 2009 at 12:25 AM, Ted Dunning <[email protected]> wrote:
> On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <[email protected]> wrote:
>
>> ...
>> Currently we do a rather simple process: compute for each document
>> TFIDF of all terms in the corpus. This is implemented straight-forward
>> as a two-step map/reduce job. First a map job computes (and serializes
>> to HBASE) TF histograms for each document. Then a reduce job computes
>> the IDF of all terms occuring in the corpus and serializes the list of
>> term/IDF pairs to HDFS. Finally, a third map job uses the serialized
>> term/IDF pairs and TF histograms to compute a feature vector for each
>> document. So basically, our feature space is the set of all term/IDF
>> pairs.
>>
>
> You could also use the code in Mahout that allows you to write a Lucene
> index as a sequence of document vectors.
>
> In any case, you should look at the format already in use by Mahout tools to
> match those to what you do.
>
> I currently see one major issue with this approach: our feature space
>> - and thus our feature vectors - will probably get very large when
>> many documents are scanned. This will obviously lead to the clustering
>> being very slow.
>
>
> Not necessarily.  If you use sparse vectors, then the dot products required
> are pretty fast.
>
>
>> We probably will have to perform some kind of feature
>> reduction during the feature extraction to get smaller - but still
>> expressive - feature vectors. One idea would e.g. be to perform PCA on
>> the "complete" feature vectors in order to identify dimensions that
>> can be pruned. However, this might be computationally too expensive.
>> Since I am not very experienced in this field, I hoped that some of
>> you could share their thoughts or sugestions on the issue.
>>
>
> Read the archives for Mahout-dev.  Several developers are working on SVD
> decomposition algorithms that run in parallel to do what you need.
>
> In addition, you could make use of some of the random indexing or simhash
> methods.  At the simplest level, simply assign a low dimensional unit length
> random vector to each feature.  Then the vector for each document would be
> the IDF weighted sum of the vectors for each feature.  If you use a moderate
> to high dimensional vector (> 50 dimensions, <500), this will give you nice
> fixed length vectors that will pretty accurately preserve dot products of
> the original data.  This is pretty much what a sim-hash does, except that
> simhashes go one step further and binarize the vectors.
>
> A slightly more involved approach would be to use these same initial random
> vectors and update them so that the new vectors for each term or other
> feature are the IDF weighted sum of terms or features that occur nearby.
> This makes it so that terms that appear in similar contexts will have
> similar directions which is a simple step toward getting vectors with
> semantic values.  This can be done very efficiently in a single map-reduce
> pass over your data.  You would use these feature vectors just like you did
> with the sim-hash technique.  This class of techniques is known as random
> indexing.
>
> A third approach is to use random projects to make computation of the SVD of
> the document x term matrix tractable.  In fact, for your purposes, you can
> use a truncated and simplified algorithm that only computes the singular
> vectors for the terms which you would then use in similar wise to the first
> two methods.  In contrast, you can also use the truncated algorithm to only
> compute the singular vectors for the documents without ever computing term
> vectors.  This is useful if all you want to do is cluster the documents and
> don't really care about the term vectors.  As I mentioned, there should be a
> bit of code appearing shortly that does some or all of this.  The speed of
> this approach should be quite high.
>

Re: Feature Extraction for TU Berlin Winter of Code project

Reply via email to