Re: Cluster text docs

Drew Farris Thu, 17 Dec 2009 19:34:38 -0800

Hi Benson,

I've managed to go from a lucene index to k-means output with a couple
smaller corpora. One around 500k items, about 1M total/100k unique
tokens and another with about half that number of items but with about
3M total/300k unique tokens (unigrams in some cases and a mixture of
unigrams and a limited set of bigrams in another). I ended up doing a
number of runs with various settings, but somewhat arbitrarily I ended
up filtering out terms that appeared in less than 8 items. I started
with 1000 random centroids and ran 10 iterations. These runs were able
to complete overnight on my minuscule 2 machine cluster I use for
testing, the probably would have run without a problem without using a
cluster at all. I never did go back an check to see if they had
converged before running all 10 iterations.

In each case I had the tools to inject item labels and tokens into a
lucene index already, so I did not have to use any mahout provided
tools to set that up. It would be nice to provide a tool that did
this, but what general-purpose tokenization pipeline should be used?
In my case I was using a processor based on something developed
internally for another project.

Nevertheless, the lucene index had a stored field for document labels
and an tokenized, indexed field with term vectors stored from which
the tokens were extracted. Using o.a.m.utils.vectors.lucene.Driver, I
was able to produce vectors suitable as a starting point for k-means.

After running, k-means emits cluster and point data. Everything can be
dumped using o.a.m.utils.clustering.ClusterDumper, which takes the
clustering output and the dictionary file produced by the
lucene.Driver and produces a text file containing what I believe to be
a gson(?) representation of the SparseVector representing the centroid
of the cluster (need to verify this), the top terms found in the
cluster,  and the labels of the items that fell into that cluster.
I've managed to opened up the ClusterDumper code and produce something
that emits documents and their cluster assignments to support the
investigation I'm doing.

I have not done an exhaustive amount validation on the output, but
based on what I have done, the results look very promising.

I've tried to run LDA on the same corpora, but haven't met with any
success. I'm under the impression that I'm either doing something
horribly wrong, or the scaling characteristics of the algorithm are
quite different than k-means. I haven't managed to get my head around
the algorithm or read the code enough to figure out what the problem
could be at this point.

What are the characteristics of the collection of documents are you
attempting to cluster?

Drew

On Thu, Dec 17, 2009 at 6:30 PM, Benson Margulies <[email protected]> wrote:
> Gang,
>
> What's the state of the world on clustering a raft of textual
> documents? Are all the pieces in place to start from a directory of
> flat text files, push through Lucene to get the vectors, keep labels
> on the vectors to point back to the files, and run, say, k-means?
>
> I've got enough data here that skimming off the top few unigrams might
> also be advisable.
>
> I tried running this through Weka, and blew it out of virtual memory.
>
> --benson
>

Re: Cluster text docs

Reply via email to