I have a large pile of Hebrew news articles. I want to cluster them so that I can select a disparate subset for initial tagging of a named entity extraction model.
On Thu, Dec 17, 2009 at 10:34 PM, Drew Farris <[email protected]> wrote: > Hi Benson, > > I've managed to go from a lucene index to k-means output with a couple > smaller corpora. One around 500k items, about 1M total/100k unique > tokens and another with about half that number of items but with about > 3M total/300k unique tokens (unigrams in some cases and a mixture of > unigrams and a limited set of bigrams in another). I ended up doing a > number of runs with various settings, but somewhat arbitrarily I ended > up filtering out terms that appeared in less than 8 items. I started > with 1000 random centroids and ran 10 iterations. These runs were able > to complete overnight on my minuscule 2 machine cluster I use for > testing, the probably would have run without a problem without using a > cluster at all. I never did go back an check to see if they had > converged before running all 10 iterations. > > In each case I had the tools to inject item labels and tokens into a > lucene index already, so I did not have to use any mahout provided > tools to set that up. It would be nice to provide a tool that did > this, but what general-purpose tokenization pipeline should be used? > In my case I was using a processor based on something developed > internally for another project. > > Nevertheless, the lucene index had a stored field for document labels > and an tokenized, indexed field with term vectors stored from which > the tokens were extracted. Using o.a.m.utils.vectors.lucene.Driver, I > was able to produce vectors suitable as a starting point for k-means. > > After running, k-means emits cluster and point data. Everything can be > dumped using o.a.m.utils.clustering.ClusterDumper, which takes the > clustering output and the dictionary file produced by the > lucene.Driver and produces a text file containing what I believe to be > a gson(?) representation of the SparseVector representing the centroid > of the cluster (need to verify this), the top terms found in the > cluster, and the labels of the items that fell into that cluster. > I've managed to opened up the ClusterDumper code and produce something > that emits documents and their cluster assignments to support the > investigation I'm doing. > > I have not done an exhaustive amount validation on the output, but > based on what I have done, the results look very promising. > > I've tried to run LDA on the same corpora, but haven't met with any > success. I'm under the impression that I'm either doing something > horribly wrong, or the scaling characteristics of the algorithm are > quite different than k-means. I haven't managed to get my head around > the algorithm or read the code enough to figure out what the problem > could be at this point. > > What are the characteristics of the collection of documents are you > attempting to cluster? > > Drew > > On Thu, Dec 17, 2009 at 6:30 PM, Benson Margulies <[email protected]> > wrote: >> Gang, >> >> What's the state of the world on clustering a raft of textual >> documents? Are all the pieces in place to start from a directory of >> flat text files, push through Lucene to get the vectors, keep labels >> on the vectors to point back to the files, and run, say, k-means? >> >> I've got enough data here that skimming off the top few unigrams might >> also be advisable. >> >> I tried running this through Weka, and blew it out of virtual memory. >> >> --benson >> >
