On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote: > On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[email protected]> wrote: >> >> The other thing I'm interested in is people's real world feedback on using >> clustering to solve their text related problems. >> For instance, what type of feature reduction did you do (stopword removal, >> stemming, etc.)? What algorithms worked for you? >> What didn't work? Any and all insight is welcome and I don't particularly >> care if it is Mahout specific (for instance, part of >> the chapter is about search result clustering using Carrot2 and so Mahout >> isn't applicable) >> > > Let me start by saying Mahout works great for us. We can run k-means > on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a > single host. > > Using vector normalization like L2 norm helped quite a bit. Thanks to > Ted for this suggestion. In text clustering, you have lots of small > documents. This results into very sparse vectors (total of 100K > features with each vector having 200 features.) Using vanilla TFIDF > weights doesn't work as nicely. > > Even if we don't do explicit stop word removal, the threshold values > for document count does that in a better fashion. If you exclude the > features which are extremely common (say more than 40% documents) or > extremely rare (say in less than 50 documents in a corpus of 100K > docs), you have a meaningful set of features. The current K-Means > already accepts these parameters.
You mean the Lucene Driver that creates the vectors, right? > > Stemming can be used for feature reduction, but it has a minor issue. > When you want to find out prominent features of the resulting cluster > centroid, the feature may not be meaningful. For example, if a > prominent feature is "beautiful", when you get it back, you will get > "beauti." Ouch. Right, but this is easily handled via something like Lucene's highlighter functionality. I bet it could be made to work on Mahout's vectors (+ a dictionary) fairly easily. > > I tried fuzzy K-Means for soft clustering, but I didn't get good > results. May be the corpus had the issue. > > One observation about the clustering process is that it is geared, by > accident or by design, towards batch processing. There is no > support for real-time clustering. There needs to be glue which ties > all the components together to make the process seamless. I suppose, > someone in need of this feature will contribute it to Mahout. Right. This should be pretty easy to remedy, though. One could simply use the previous results as the --clusters option, right? > > Grant, If I recall more, I will mail it to you. Great! Thank you. > > --shashi -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
