As some of you may know, I'm working on a book (it's a long time coming, but 
I'm getting there) about open source techniques for working with text.  One of 
my chapters is on clustering and in it, I want to talk about generic clustering 
approaches and then show concrete examples of them in action.   I've got the 
concrete side of it down.

Based on my research, it seems people typically divide up the clustering space 
into two approaches: hierarchical and flat/partitioning.  In overlaying that 
knowledge with what we have for techniques in Mahout, I'm a bit stumped about 
where things like LDA and Dirichlet fit into those two approaches or is there, 
perhaps a third that I'm missing?  They don't seem particularly hierarchical 
but they don't seem flat either, if that makes any sense, given the 
probabilistic/mixture nature of the algorithms.  Perhaps I should forgo the 
traditional division that previous authors have taken and just talk about a 
suite of techniques at a little lower level?  Thoughts?

The other thing I'm interested in is people's real world feedback on using 
clustering to solve their text related problems.  For instance, what type of 
feature reduction did you do (stopword removal, stemming, etc.)?  What 
algorithms worked for you?  What didn't work?  Any and all insight is welcome 
and I don't particularly care if it is Mahout specific (for instance, part of 
the chapter is about search result clustering using Carrot2 and so Mahout isn't 
applicable)

Thanks in advance and Happy New Year,
Grant

Reply via email to