On Sep 17, 2009, at 12:36 AM, Aleksander Stensby wrote:

Hi all,
I've been following the development of Mahout for quite a while now and
figured it was time for me to get my hands dirty:)

I've gone through the examples and Grant's excellent IBM article (great work
on that Grant!).

Thanks!

So, now I'm at the point where I want to figure out where I go next.
Specifically, I'm a bit fuzzed about common practices when it comes to
utilizing Mahout in my own applications...

Case scenario:
I have my own project, add the dependencies to Mahout (through maven), and
make my own little kMeans test class.
I guess my question is a bit stupid, but how would you go about using Mahout
out of the box?

Ideally (or maybe not?), I figured that I could just take care of providing
the Vectors -> push it into mahout and run the kMeans clustering...
But when I started looking at the kMeans clustering example, I notice that there is actually a lot of implementation in the example itself... Is it really necessary for me to implement all of those methods in every project where I want to do kMeans? Can't they be reused? The methods I talk about
are for instance:
static List<Canopy> populateCanopies(DistanceMeasure measure, List<Vector>
points, double t1, double t2)

Yeah, this one is a bit weird here.

 private static void referenceKmeans(List<Vector> points,
List<List<Cluster>> clusters, DistanceMeasure measure, int maxIter)

I think that is for testing purposes, but don't have the code up at the mo'.

private static boolean iterateReference(List<Vector> points, List<Cluster>
clusters, DistanceMeasure measure)

In my narrow minded head I would think that input would be the List<Vector>
and that the output would be List<List<Cluster> of some general kMeans
method that did all the internals for me... Or am I missing something? Or do I have to use the KMeansDriver.runJob and read input from serialized vectors
files?

I think the piece that is missing is these algs. are designed to scale and use Hadoop. Imagine passing around 5+ million dense vectors of with large cardinality.

Reply via email to