On Thu, Dec 31, 2009 at 9:10 AM, Grant Ingersoll <[email protected]>wrote:
> As some of you may know, I'm working on a book (it's a long time coming, > but I'm getting there) about open source techniques for working with text. > ... > Based on my research, it seems people typically divide up the clustering > space into two approaches: hierarchical and flat/partitioning. There are a number of ways of dividing the space of all clustering techniques. Whether the final result is hierarchical is an important criterion. Other important characteristics include: - are cluster memberships hard or soft (k-means = hard, LDA and Dirichlet = soft) - can the clustering algorithm be viewed in a probabilistic framework (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest neighbors = not so much) - is the definition of a cluster abstract enough to be flexible with regard to whether a cluster is a model or does it require stronger limits. (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any probabilistic model) - is the algorithm updateable or does it have to run from scratch (k-means, Dirichlet = yes, agglomerative clustering = not easily) - is the algorithm non-parametric (which for clustering pretty much reduces to whether the number and complexity of clusters can increase without bound as the amount of data increases, Dirchlet = yes) - does the algorithm operationally take linear time in the size of the data (k-means yes, LDA = not sure, Dirichlet = pretty much, agglomerative clustering = no for most algorithms) - can the algorithm make use of pre-existing knowledge or user adjustments? (k-means yes, Dirichlet yes) Note that it is pretty easy to adapt several algorithms like k-means to be hierarchical. > In overlaying that knowledge with what we have for techniques in Mahout, > I'm a bit stumped about where things like LDA and Dirichlet fit into those > two approaches or is there, perhaps a third that I'm missing? They don't > seem particularly hierarchical but they don't seem flat either, if that > makes any sense, given the probabilistic/mixture nature of the algorithms. > Perhaps I should forgo the traditional division that previous authors have > taken and just talk about a suite of techniques at a little lower level? > Thoughts? > I think that some of these distinctions are interesting but I think it is also very easy to confuse newcomers with too many distinctions. A big part of the confusion has to do with the fact that none of these distinctions is comprehensive, nor are any of these completely clear cut. > The other thing I'm interested in is people's real world feedback on using > clustering to solve their text related problems. For instance, what type of > feature reduction did you do (stopword removal, stemming, etc.)? What > algorithms worked for you? What didn't work? My experience has been that almost any reasonable implementation of an algorithm in the k-means family will work reasonably well (this includes LDA and Dirichlet effectively) and none of them are completely excellent. Success with text clustering strongly depends on setting expectations correctly in the interface. If this user expects the clustering to be useful but not definitive then they tend to find clustering to be high value. If the user expects the clustering to be absolutely definitive, they will be sorely disappointed. For cases where perfection is expected, it is often important to have good integration between clustering and human edits. The carrot2 implementation is among the best that I have used in terms of the quality of clustering small batches of results such as search result lists.
