Re: Clustering techniques, tips and tricks

Ted Dunning Thu, 31 Dec 2009 12:40:03 -0800

On Thu, Dec 31, 2009 at 9:10 AM, Grant Ingersoll <[email protected]>wrote:


> As some of you may know, I'm working on a book (it's a long time coming,
> but I'm getting there) about open source techniques for working with text.
> ...
> Based on my research, it seems people typically divide up the clustering
> space into two approaches: hierarchical and flat/partitioning.


There are a number of ways of dividing the space of all clustering
techniques.  Whether the final result is hierarchical is an important
criterion.

Other important characteristics include:

- are cluster memberships hard or soft (k-means = hard, LDA and Dirichlet =
soft)

- can the clustering algorithm be viewed in a probabilistic framework
(k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest
neighbors = not so much)

- is the definition of a cluster abstract enough to be flexible with regard
to whether a cluster is a model or does it require stronger limits.
(k-means = symmetric Gaussian with equal variance, Dirichlet = almost any
probabilistic model)

- is the algorithm updateable or does it have to run from scratch (k-means,
Dirichlet = yes, agglomerative clustering = not easily)

- is the algorithm non-parametric (which for clustering pretty much reduces
to whether the number and complexity of clusters can increase without bound
as the amount of data increases, Dirchlet = yes)

- does the algorithm operationally take linear time in the size of the data
(k-means yes, LDA = not sure, Dirichlet = pretty much, agglomerative
clustering = no for most algorithms)

- can the algorithm make use of pre-existing knowledge or user adjustments?
(k-means yes, Dirichlet yes)

Note that it is pretty easy to adapt several algorithms like k-means to be
hierarchical.


> In overlaying that knowledge with what we have for techniques in Mahout,
> I'm a bit stumped about where things like LDA and Dirichlet fit into those
> two approaches or is there, perhaps a third that I'm missing?  They don't
> seem particularly hierarchical but they don't seem flat either, if that
> makes any sense, given the probabilistic/mixture nature of the algorithms.
>  Perhaps I should forgo the traditional division that previous authors have
> taken and just talk about a suite of techniques at a little lower level?
>  Thoughts?
>

I think that some of these distinctions are interesting but I think it is
also very easy to confuse newcomers with too many distinctions.  A big part
of the confusion has to do with the fact that none of these distinctions is
comprehensive, nor are any of these completely clear cut.


> The other thing I'm interested in is people's real world feedback on using
> clustering to solve their text related problems.  For instance, what type of
> feature reduction did you do (stopword removal, stemming, etc.)?  What
> algorithms worked for you?  What didn't work?


My experience has been that almost any reasonable implementation of an
algorithm in the k-means family will work reasonably well (this includes LDA
and Dirichlet effectively) and none of them are completely excellent.
Success with text clustering strongly depends on setting expectations
correctly in the interface.  If this user expects the clustering to be
useful but not definitive then they tend to find clustering to be high
value.  If the user expects the clustering to be absolutely definitive, they
will be sorely disappointed.  For cases where perfection is expected, it is
often important to have good integration between clustering and human edits.

The carrot2 implementation is among the best that I have used in terms of
the quality of clustering small batches of results such as search result
lists.

Re: Clustering techniques, tips and tricks

Reply via email to