On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote: > On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[email protected]> wrote: >> >> The other thing I'm interested in is people's real world feedback on using >> clustering to solve their text related problems. >> For instance, what type of feature reduction did you do (stopword removal, >> stemming, etc.)? What algorithms worked for you? >> What didn't work? Any and all insight is welcome and I don't particularly >> care if it is Mahout specific (for instance, part of >> the chapter is about search result clustering using Carrot2 and so Mahout >> isn't applicable) >> > > > Using vector normalization like L2 norm helped quite a bit.
As I recall, it is important that the choice of norms aligns with the choice of distance measures, as well as data source (http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors and http://www.lucidimagination.com/search/document/34ffc2a83a71a055/centroid_calculations_with_sparse_vectors#3d8310376b6cdf6b)
