RE: Two quick Mahout questions

Jeff Eastman Fri, 22 Jul 2011 06:38:56 -0700

Yes, it is possible though this aspect of the algorithms is not exactly 
pluggable. You would need to introduce a new subclass of Cluster (the class, 
not the interface) and override computeParameters and computeCentroid methods. 
I think KMeans would probably be fooled into using it if you provided instances 
of your new Cluster as the prior (-ci) argument.

The ClusterClassifier/Iterator is another clustering implementation which is 
much more pluggable, though it currently does not have a Hadoop implementation. 
Writing a driver, mapper & reducer for this ought to be straightforward.

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Friday, July 22, 2011 1:23 AM
To: Brett Wines
Cc: Mahout Dev List
Subject: Re: Two quick Mahout questions

Copying dev list with Brett's permission.

On Thu, Jul 21, 2011 at 8:00 PM, Brett Wines <[email protected]>wrote:

>
> I'm writing in response to a question on a Mahout 
> forum<http://search.lucidimagination.com/search/document/943a194edea159fc/string_clustering_and_other_newbie_questions>;
> I was wondering if you could answer a question or two for me?
>

Sure.

> First, do you know if there's a good way to plug in one's own
> centroid-computing function for Mahout algorithms like k-means or EM?
>

Actually, I am not entirely sure.  I think that there is.

It is definitely true that there is a good way to plug in a new distance
function for computing cluster membership.

Jeff, is it easy to plug in a new centroid function?  I think that you said
yes to this as part of the classification/clustering unification work.

Second, do you know if there's any way at all to run Mahout clustering
> algorithms on things where the features aren't numbers? The vectors don't
> support anything except for doubles and it and it's hack-y and messy to map
> non-numerical feature data to arbitrary numbers and then in a custom
> distance function undo the mapping (the DistanceMeasure interface requires
> the comparison function to take in Vectors as parameters) and there's got to
> be a better solution.
>

Hmm... I am not clear on all of your requirements, but there are at least
two methods for doing this.

One method that is commonly used to do this is to do classic vector space
conversion of text-like data.  With this, there is one dimension in the
feature vector per unique word.  There wide support for this with clustering
in Mahout.  This is also easy to reverse engineer, but it doesn't support
stupendous or open-ended vocabularies.

Another method is to use hash-encoding.  This allows combinations of
continuous, text-like and word-like data into a fixed size vector that is
merely large instead of stupendous in size.  This representation is nice and
consistent, but it can be difficult to reverse engineer.

RE: Two quick Mahout questions

Reply via email to