On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <[email protected]> wrote:
> ... > Currently we do a rather simple process: compute for each document > TFIDF of all terms in the corpus. This is implemented straight-forward > as a two-step map/reduce job. First a map job computes (and serializes > to HBASE) TF histograms for each document. Then a reduce job computes > the IDF of all terms occuring in the corpus and serializes the list of > term/IDF pairs to HDFS. Finally, a third map job uses the serialized > term/IDF pairs and TF histograms to compute a feature vector for each > document. So basically, our feature space is the set of all term/IDF > pairs. > You could also use the code in Mahout that allows you to write a Lucene index as a sequence of document vectors. In any case, you should look at the format already in use by Mahout tools to match those to what you do. I currently see one major issue with this approach: our feature space > - and thus our feature vectors - will probably get very large when > many documents are scanned. This will obviously lead to the clustering > being very slow. Not necessarily. If you use sparse vectors, then the dot products required are pretty fast. > We probably will have to perform some kind of feature > reduction during the feature extraction to get smaller - but still > expressive - feature vectors. One idea would e.g. be to perform PCA on > the "complete" feature vectors in order to identify dimensions that > can be pruned. However, this might be computationally too expensive. > Since I am not very experienced in this field, I hoped that some of > you could share their thoughts or sugestions on the issue. > Read the archives for Mahout-dev. Several developers are working on SVD decomposition algorithms that run in parallel to do what you need. In addition, you could make use of some of the random indexing or simhash methods. At the simplest level, simply assign a low dimensional unit length random vector to each feature. Then the vector for each document would be the IDF weighted sum of the vectors for each feature. If you use a moderate to high dimensional vector (> 50 dimensions, <500), this will give you nice fixed length vectors that will pretty accurately preserve dot products of the original data. This is pretty much what a sim-hash does, except that simhashes go one step further and binarize the vectors. A slightly more involved approach would be to use these same initial random vectors and update them so that the new vectors for each term or other feature are the IDF weighted sum of terms or features that occur nearby. This makes it so that terms that appear in similar contexts will have similar directions which is a simple step toward getting vectors with semantic values. This can be done very efficiently in a single map-reduce pass over your data. You would use these feature vectors just like you did with the sim-hash technique. This class of techniques is known as random indexing. A third approach is to use random projects to make computation of the SVD of the document x term matrix tractable. In fact, for your purposes, you can use a truncated and simplified algorithm that only computes the singular vectors for the terms which you would then use in similar wise to the first two methods. In contrast, you can also use the truncated algorithm to only compute the singular vectors for the documents without ever computing term vectors. This is useful if all you want to do is cluster the documents and don't really care about the term vectors. As I mentioned, there should be a bit of code appearing shortly that does some or all of this. The speed of this approach should be quite high.
