On Sep 17, 2009, at 12:36 AM, Aleksander Stensby wrote:
Hi all,
I've been following the development of Mahout for quite a while now
and
figured it was time for me to get my hands dirty:)
I've gone through the examples and Grant's excellent IBM article
(great work
on that Grant!).
Thanks!
So, now I'm at the point where I want to figure out where I go next.
Specifically, I'm a bit fuzzed about common practices when it comes to
utilizing Mahout in my own applications...
Case scenario:
I have my own project, add the dependencies to Mahout (through
maven), and
make my own little kMeans test class.
I guess my question is a bit stupid, but how would you go about
using Mahout
out of the box?
Ideally (or maybe not?), I figured that I could just take care of
providing
the Vectors -> push it into mahout and run the kMeans clustering...
But when I started looking at the kMeans clustering example, I
notice that
there is actually a lot of implementation in the example itself...
Is it
really necessary for me to implement all of those methods in every
project
where I want to do kMeans? Can't they be reused? The methods I talk
about
are for instance:
static List<Canopy> populateCanopies(DistanceMeasure measure,
List<Vector>
points, double t1, double t2)
Yeah, this one is a bit weird here.
private static void referenceKmeans(List<Vector> points,
List<List<Cluster>> clusters, DistanceMeasure measure, int maxIter)
I think that is for testing purposes, but don't have the code up at
the mo'.
private static boolean iterateReference(List<Vector> points,
List<Cluster>
clusters, DistanceMeasure measure)
In my narrow minded head I would think that input would be the
List<Vector>
and that the output would be List<List<Cluster> of some general kMeans
method that did all the internals for me... Or am I missing
something? Or do
I have to use the KMeansDriver.runJob and read input from serialized
vectors
files?
I think the piece that is missing is these algs. are designed to scale
and use Hadoop. Imagine passing around 5+ million dense vectors of
with large cardinality.