Forwarding this to dev. ---------- Forwarded message ---------- From: Frank Scholten <fr...@frankscholten.nl> Date: Tue, Nov 8, 2011 at 11:56 PM Subject: Cluster labeling To: u...@mahout.apache.org
Hi all, Sometimes my cluster labels are terms that hardly occur in the combined text of the documents of a cluster. I would expect to see a label of a term that occurs very frequently across documents of the cluster. For example, suppose there is a cluster of tweets about Mahout. You would see a lot of occurences of 'Apache Mahout' in every document. Maybe a few documents have the term 'License' in them. You could end up with a 'License' label instead of 'Apache Mahout'. I think this happens when Mahout sorts the cluster centroid by TF-IDF weight in descending order and fetches the correlated terms. So the 'License' label will be chosen because it has a high TF-IDF even though it has a low cluster frequency. Thoughts? Cheers, Frank