Hi Benson, I've managed to go from a lucene index to k-means output with a couple smaller corpora. One around 500k items, about 1M total/100k unique tokens and another with about half that number of items but with about 3M total/300k unique tokens (unigrams in some cases and a mixture of unigrams and a limited set of bigrams in another). I ended up doing a number of runs with various settings, but somewhat arbitrarily I ended up filtering out terms that appeared in less than 8 items. I started with 1000 random centroids and ran 10 iterations. These runs were able to complete overnight on my minuscule 2 machine cluster I use for testing, the probably would have run without a problem without using a cluster at all. I never did go back an check to see if they had converged before running all 10 iterations.
In each case I had the tools to inject item labels and tokens into a lucene index already, so I did not have to use any mahout provided tools to set that up. It would be nice to provide a tool that did this, but what general-purpose tokenization pipeline should be used? In my case I was using a processor based on something developed internally for another project. Nevertheless, the lucene index had a stored field for document labels and an tokenized, indexed field with term vectors stored from which the tokens were extracted. Using o.a.m.utils.vectors.lucene.Driver, I was able to produce vectors suitable as a starting point for k-means. After running, k-means emits cluster and point data. Everything can be dumped using o.a.m.utils.clustering.ClusterDumper, which takes the clustering output and the dictionary file produced by the lucene.Driver and produces a text file containing what I believe to be a gson(?) representation of the SparseVector representing the centroid of the cluster (need to verify this), the top terms found in the cluster, and the labels of the items that fell into that cluster. I've managed to opened up the ClusterDumper code and produce something that emits documents and their cluster assignments to support the investigation I'm doing. I have not done an exhaustive amount validation on the output, but based on what I have done, the results look very promising. I've tried to run LDA on the same corpora, but haven't met with any success. I'm under the impression that I'm either doing something horribly wrong, or the scaling characteristics of the algorithm are quite different than k-means. I haven't managed to get my head around the algorithm or read the code enough to figure out what the problem could be at this point. What are the characteristics of the collection of documents are you attempting to cluster? Drew On Thu, Dec 17, 2009 at 6:30 PM, Benson Margulies <[email protected]> wrote: > Gang, > > What's the state of the world on clustering a raft of textual > documents? Are all the pieces in place to start from a directory of > flat text files, push through Lucene to get the vectors, keep labels > on the vectors to point back to the files, and run, say, k-means? > > I've got enough data here that skimming off the top few unigrams might > also be advisable. > > I tried running this through Weka, and blew it out of virtual memory. > > --benson >
