On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
Hi Aleksander,
I've also been learning how to run mahout's clustering and LDA on our
cluster.
For k-means, the following series of steps has worked for me:
* build mahout from trunk
* write a program to convert your data to mahout Vectors. You can
base
this on one of the Drivers in the mahout.utils.vectors package (which
seem designed to work locally). For bigger datasets you'll probably
need to write a simple map reduce job, more like
mahout.clustering.syntheticcontrol.canopy.InputDriver. In either
event
your Vectors need to end up on the dfs.
Yeah, they are designed for local so far, but we should work to extend
them. I think as Mahout matures, this problem will become less and
less. Ultimately, I'd like to see utilities that simply ingest
whatever is up on HDFS (office docs, PDFs, mail, etc.) and just works,
but that is a _long_ way off, unless someone wants to help drive that.
Those kinds of utilities would be great contributions from someone
looking to get started contributing. As I see it, we could leverage
Apache Tika with a M/R job to produce the appropriate kinds of things
for our various algorithms.
* run clustering with
org.apache.mahout.clustering.kmeans.KMeansDriver,
something like:
hadoop jar mahout-core-0.2-SNAPSHOT.job
org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/
dir
-c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
-x <maxIters>
* possibly fix the problem described here
http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
-of-KMeans-td24505889.html (solution is at the bottom of the page)
* get all the output files locally
* convert the output to text format with
org.apache.mahout.utils.clustering.ClusterDumper. It might be nicer
to
do this on the cluster, but the code seems to expect local files. If
you set the name field in your input Vectors in the conversion step
to a
suitable ID, then the final output can be a set of cluster centroids,
each followed by the list of Vector IDs in the corresponding cluster.
Hope this is useful.
More importantly, if anything here is very wrong then please can a
mahout person correct me!
Looks good to me. Suggestions/patches are welcome!