Thanks for all the replies guys! I understand the flow of things and it makes sense, but like Shawn pointed out there could still be more abstraction (and once I get my hands dirty I'll try to do my best to contribute here aswell:) )
And to Levy: your proposed flow of things makes sense, but what I wanted was to do all that from one entry point. (Ideally, I don't want to do manual stuff here, I want everything to be able to run on a regular basis from a single entrypoint - and then I mean any algorithm etc). And I can probably do that just fine by using the Drivers etc. Again, thanks for the replies! Cheers, Aleks On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <[email protected]>wrote: > > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote: > > Hi Aleksander, >> >> I've also been learning how to run mahout's clustering and LDA on our >> cluster. >> >> For k-means, the following series of steps has worked for me: >> >> * build mahout from trunk >> >> * write a program to convert your data to mahout Vectors. You can base >> this on one of the Drivers in the mahout.utils.vectors package (which >> seem designed to work locally). For bigger datasets you'll probably >> need to write a simple map reduce job, more like >> mahout.clustering.syntheticcontrol.canopy.InputDriver. In either event >> your Vectors need to end up on the dfs. >> > > Yeah, they are designed for local so far, but we should work to extend > them. I think as Mahout matures, this problem will become less and less. > Ultimately, I'd like to see utilities that simply ingest whatever is up on > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_ > way off, unless someone wants to help drive that. > > Those kinds of utilities would be great contributions from someone looking > to get started contributing. As I see it, we could leverage Apache Tika > with a M/R job to produce the appropriate kinds of things for our various > algorithms. > > >> * run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver, >> something like: >> hadoop jar mahout-core-0.2-SNAPSHOT.job >> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters> >> -x <maxIters> >> >> * possibly fix the problem described here >> http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run >> -of-KMeans-td24505889.html (solution is at the bottom of the page) >> >> * get all the output files locally >> >> * convert the output to text format with >> org.apache.mahout.utils.clustering.ClusterDumper. It might be nicer to >> do this on the cluster, but the code seems to expect local files. If >> you set the name field in your input Vectors in the conversion step to a >> suitable ID, then the final output can be a set of cluster centroids, >> each followed by the list of Vector IDs in the corresponding cluster. >> >> Hope this is useful. >> >> More importantly, if anything here is very wrong then please can a >> mahout person correct me! >> > > Looks good to me. Suggestions/patches are welcome! > > -- Aleksander M. Stensby Lead Software Developer and System Architect Integrasco A/S E-mail: [email protected] Tel.: +47 41 22 82 72 www.integrasco.com http://twitter.com/Integrasco http://facebook.com/Integrasco Please consider the environment before printing all or any of this e-mail
