Re: Some basic introductory questions

Grant Ingersoll Thu, 17 Sep 2009 06:36:34 -0700


On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:

Hi Aleksander,

I've also been learning how to run mahout's clustering and LDA on our
cluster.

For k-means, the following series of steps has worked for me:

* build mahout from trunk

* write a program to convert your data to mahout Vectors. You canbase

this on one of the Drivers in the mahout.utils.vectors package (which
seem designed to work locally).  For bigger datasets you'll probably
need to  write a simple map reduce job, more like

mahout.clustering.syntheticcontrol.canopy.InputDriver. In eitherevent

your Vectors need to end up on the dfs.

Yeah, they are designed for local so far, but we should work to extendthem. I think as Mahout matures, this problem will become less andless. Ultimately, I'd like to see utilities that simply ingestwhatever is up on HDFS (office docs, PDFs, mail, etc.) and just works,but that is a _long_ way off, unless someone wants to help drive that.

Those kinds of utilities would be great contributions from someonelooking to get started contributing. As I see it, we could leverageApache Tika with a M/R job to produce the appropriate kinds of thingsfor our various algorithms.

* run clustering withorg.apache.mahout.clustering.kmeans.KMeansDriver,

something like:
  hadoop jar mahout-core-0.2-SNAPSHOT.job

org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir

-c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
-x <maxIters>

* possibly fix the problem described here
http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
-of-KMeans-td24505889.html (solution is at the bottom of the page)

* get all the output files locally

* convert the output to text format with

org.apache.mahout.utils.clustering.ClusterDumper. It might be nicerto

do this on the cluster, but the code seems to expect local files.  If

you set the name field in your input Vectors in the conversion stepto a

suitable ID, then the final output can be a set of cluster centroids,
each followed by the list of Vector IDs in the corresponding cluster.

Hope this is useful.

More importantly, if anything here is very wrong then please can a
mahout person correct me!


Looks good to me.  Suggestions/patches are welcome!

Re: Some basic introductory questions

Reply via email to