Yes, it is possible though this aspect of the algorithms is not exactly pluggable. You would need to introduce a new subclass of Cluster (the class, not the interface) and override computeParameters and computeCentroid methods. I think KMeans would probably be fooled into using it if you provided instances of your new Cluster as the prior (-ci) argument.
The ClusterClassifier/Iterator is another clustering implementation which is much more pluggable, though it currently does not have a Hadoop implementation. Writing a driver, mapper & reducer for this ought to be straightforward. -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Friday, July 22, 2011 1:23 AM To: Brett Wines Cc: Mahout Dev List Subject: Re: Two quick Mahout questions Copying dev list with Brett's permission. On Thu, Jul 21, 2011 at 8:00 PM, Brett Wines <[email protected]>wrote: > > I'm writing in response to a question on a Mahout > forum<http://search.lucidimagination.com/search/document/943a194edea159fc/string_clustering_and_other_newbie_questions>; > I was wondering if you could answer a question or two for me? > Sure. > First, do you know if there's a good way to plug in one's own > centroid-computing function for Mahout algorithms like k-means or EM? > Actually, I am not entirely sure. I think that there is. It is definitely true that there is a good way to plug in a new distance function for computing cluster membership. Jeff, is it easy to plug in a new centroid function? I think that you said yes to this as part of the classification/clustering unification work. Second, do you know if there's any way at all to run Mahout clustering > algorithms on things where the features aren't numbers? The vectors don't > support anything except for doubles and it and it's hack-y and messy to map > non-numerical feature data to arbitrary numbers and then in a custom > distance function undo the mapping (the DistanceMeasure interface requires > the comparison function to take in Vectors as parameters) and there's got to > be a better solution. > Hmm... I am not clear on all of your requirements, but there are at least two methods for doing this. One method that is commonly used to do this is to do classic vector space conversion of text-like data. With this, there is one dimension in the feature vector per unique word. There wide support for this with clustering in Mahout. This is also easy to reverse engineer, but it doesn't support stupendous or open-ended vocabularies. Another method is to use hash-encoding. This allows combinations of continuous, text-like and word-like data into a fixed size vector that is merely large instead of stupendous in size. This representation is nice and consistent, but it can be difficult to reverse engineer.
