On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:

Hi Aleksander,

I've also been learning how to run mahout's clustering and LDA on our
cluster.

For k-means, the following series of steps has worked for me:

* build mahout from trunk

* write a program to convert your data to mahout Vectors. You can base
this on one of the Drivers in the mahout.utils.vectors package (which
seem designed to work locally).  For bigger datasets you'll probably
need to  write a simple map reduce job, more like
mahout.clustering.syntheticcontrol.canopy.InputDriver. In either event
your Vectors need to end up on the dfs.

Yeah, they are designed for local so far, but we should work to extend them. I think as Mahout matures, this problem will become less and less. Ultimately, I'd like to see utilities that simply ingest whatever is up on HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_ way off, unless someone wants to help drive that.

Those kinds of utilities would be great contributions from someone looking to get started contributing. As I see it, we could leverage Apache Tika with a M/R job to produce the appropriate kinds of things for our various algorithms.


* run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
something like:
  hadoop jar mahout-core-0.2-SNAPSHOT.job
org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/ dir
-c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
-x <maxIters>

* possibly fix the problem described here
http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
-of-KMeans-td24505889.html (solution is at the bottom of the page)

* get all the output files locally

* convert the output to text format with
org.apache.mahout.utils.clustering.ClusterDumper. It might be nicer to
do this on the cluster, but the code seems to expect local files.  If
you set the name field in your input Vectors in the conversion step to a
suitable ID, then the final output can be a set of cluster centroids,
each followed by the list of Vector IDs in the corresponding cluster.

Hope this is useful.

More importantly, if anything here is very wrong then please can a
mahout person correct me!

Looks good to me.  Suggestions/patches are welcome!

Reply via email to