On 1/2/10 6:15 PM, Shashikant Kore wrote:
On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll<[email protected]>  wrote:
The other thing I'm interested in is people's real world feedback on using 
clustering to solve their text related problems.
For instance, what type of feature reduction did you do (stopword removal, 
stemming, etc.)?  What algorithms worked for you?
What didn't work?  Any and all insight is welcome and I don't particularly care 
if it is Mahout specific (for instance, part of
the chapter is about search result clustering using Carrot2 and so Mahout isn't 
applicable)

Let me start by saying Mahout works great for us. We can run k-means
on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to
Ted for this suggestion. In text clustering, you have lots of small
documents. This results into very sparse vectors (total of 100K
features with each vector having 200 features.) Using vanilla TFIDF
weights doesn't work as nicely.

I'm not sure what L2 norm is, but wouldn't the frequent pattern mining feature help here?
(from mahout-157) I was hoping to use it for feature reduction.

Even if we don't do explicit stop word removal, the threshold values
for document count does that in a better fashion. If you exclude the
features which are extremely common (say more than 40% documents) or
extremely rare (say in less than 50 documents in a corpus of 100K
docs), you have a meaningful set of features. The current K-Means
already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster
centroid, the feature may not be meaningful. For example,  if a
prominent feature is "beautiful", when you get it back, you will get
"beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good
results. May be the corpus had the issue.

One observation about the clustering process is that it is geared, by
accident or by design, towards batch processing. There is no
support for real-time clustering. There needs to be glue which ties
all the components together to make the process seamless. I suppose,
someone in need of this feature will contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi


Reply via email to