Hi Ted, first thanks for your answer.
> In any case, you should look at the format already in use by Mahout tools to > match those to what you do. > > I currently see one major issue with this approach: our feature space >> - and thus our feature vectors - will probably get very large when >> many documents are scanned. This will obviously lead to the clustering >> being very slow. > > > Not necessarily. If you use sparse vectors, then the dot products required > are pretty fast. Yes, we are already using the Mahout provided sparse vector implementation. > In addition, you could make use of some of the random indexing or simhash > methods. At the simplest level, simply assign a low dimensional unit length > random vector to each feature. Then the vector for each document would be > the IDF weighted sum of the vectors for each feature. If you use a moderate > to high dimensional vector (> 50 dimensions, <500), this will give you nice > fixed length vectors that will pretty accurately preserve dot products of > the original data. Random Indexing sounds like a really interesting technique for feature extraction in the tuwoc project. I think this approach combined with a more selective data cleansing and feature selection should provide us with reasonable feature vectors. Thanks for the hint! Cheers Max PS: Do you have further details on the SVD implementation you were mentioning (e.g. a paper detailing the approach). I once implemented a streaming SVD algorithm for Cell and would be interested to see the algorithm that was used for Mahout. Thanks. On Mon, Nov 30, 2009 at 12:25 AM, Ted Dunning <[email protected]> wrote: > On Sun, Nov 29, 2009 at 1:44 PM, Max Heimel <[email protected]> wrote: > >> ... >> Currently we do a rather simple process: compute for each document >> TFIDF of all terms in the corpus. This is implemented straight-forward >> as a two-step map/reduce job. First a map job computes (and serializes >> to HBASE) TF histograms for each document. Then a reduce job computes >> the IDF of all terms occuring in the corpus and serializes the list of >> term/IDF pairs to HDFS. Finally, a third map job uses the serialized >> term/IDF pairs and TF histograms to compute a feature vector for each >> document. So basically, our feature space is the set of all term/IDF >> pairs. >> > > You could also use the code in Mahout that allows you to write a Lucene > index as a sequence of document vectors. > > In any case, you should look at the format already in use by Mahout tools to > match those to what you do. > > I currently see one major issue with this approach: our feature space >> - and thus our feature vectors - will probably get very large when >> many documents are scanned. This will obviously lead to the clustering >> being very slow. > > > Not necessarily. If you use sparse vectors, then the dot products required > are pretty fast. > > >> We probably will have to perform some kind of feature >> reduction during the feature extraction to get smaller - but still >> expressive - feature vectors. One idea would e.g. be to perform PCA on >> the "complete" feature vectors in order to identify dimensions that >> can be pruned. However, this might be computationally too expensive. >> Since I am not very experienced in this field, I hoped that some of >> you could share their thoughts or sugestions on the issue. >> > > Read the archives for Mahout-dev. Several developers are working on SVD > decomposition algorithms that run in parallel to do what you need. > > In addition, you could make use of some of the random indexing or simhash > methods. At the simplest level, simply assign a low dimensional unit length > random vector to each feature. Then the vector for each document would be > the IDF weighted sum of the vectors for each feature. If you use a moderate > to high dimensional vector (> 50 dimensions, <500), this will give you nice > fixed length vectors that will pretty accurately preserve dot products of > the original data. This is pretty much what a sim-hash does, except that > simhashes go one step further and binarize the vectors. > > A slightly more involved approach would be to use these same initial random > vectors and update them so that the new vectors for each term or other > feature are the IDF weighted sum of terms or features that occur nearby. > This makes it so that terms that appear in similar contexts will have > similar directions which is a simple step toward getting vectors with > semantic values. This can be done very efficiently in a single map-reduce > pass over your data. You would use these feature vectors just like you did > with the sim-hash technique. This class of techniques is known as random > indexing. > > A third approach is to use random projects to make computation of the SVD of > the document x term matrix tractable. In fact, for your purposes, you can > use a truncated and simplified algorithm that only computes the singular > vectors for the terms which you would then use in similar wise to the first > two methods. In contrast, you can also use the truncated algorithm to only > compute the singular vectors for the documents without ever computing term > vectors. This is useful if all you want to do is cluster the documents and > don't really care about the term vectors. As I mentioned, there should be a > bit of code appearing shortly that does some or all of this. The speed of > this approach should be quite high. >
