Aleksander, As a (temporarily) naive user of the system, you are in a special position to answer a few use-case questions. Because I think that we need to collect some of these impressions, I have created a simple form with less than a dozen questions about intended use and preferred shape of the software.
Could you go to the URL below to answer those questions? http://spreadsheets.google.com/viewform?formkey=dGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E6MA .. On Thu, Sep 17, 2009 at 11:59 AM, Aleksander Stensby < [email protected]> wrote: > Thanks for all the replies guys! > I understand the flow of things and it makes sense, but like Shawn pointed > out there could still be more abstraction (and once I get my hands dirty > I'll try to do my best to contribute here aswell:) ) > > And to Levy: your proposed flow of things makes sense, but what I wanted > was > to do all that from one entry point. (Ideally, I don't want to do manual > stuff here, I want everything to be able to run on a regular basis from a > single entrypoint - and then I mean any algorithm etc). And I can probably > do that just fine by using the Drivers etc. > > Again, thanks for the replies! > > Cheers, > Aleks > > On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <[email protected] > >wrote: > > > > > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote: > > > > Hi Aleksander, > >> > >> I've also been learning how to run mahout's clustering and LDA on our > >> cluster. > >> > >> For k-means, the following series of steps has worked for me: > >> > >> * build mahout from trunk > >> > >> * write a program to convert your data to mahout Vectors. You can base > >> this on one of the Drivers in the mahout.utils.vectors package (which > >> seem designed to work locally). For bigger datasets you'll probably > >> need to write a simple map reduce job, more like > >> mahout.clustering.syntheticcontrol.canopy.InputDriver. In either event > >> your Vectors need to end up on the dfs. > >> > > > > Yeah, they are designed for local so far, but we should work to extend > > them. I think as Mahout matures, this problem will become less and less. > > Ultimately, I'd like to see utilities that simply ingest whatever is up > on > > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_ > > way off, unless someone wants to help drive that. > > > > Those kinds of utilities would be great contributions from someone > looking > > to get started contributing. As I see it, we could leverage Apache Tika > > with a M/R job to produce the appropriate kinds of things for our various > > algorithms. > > > > > >> * run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver, > >> something like: > >> hadoop jar mahout-core-0.2-SNAPSHOT.job > >> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir > >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters> > >> -x <maxIters> > >> > >> * possibly fix the problem described here > >> > http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run > >> -of-KMeans-td24505889.html (solution is at the bottom of the page) > >> > >> * get all the output files locally > >> > >> * convert the output to text format with > >> org.apache.mahout.utils.clustering.ClusterDumper. It might be nicer to > >> do this on the cluster, but the code seems to expect local files. If > >> you set the name field in your input Vectors in the conversion step to a > >> suitable ID, then the final output can be a set of cluster centroids, > >> each followed by the list of Vector IDs in the corresponding cluster. > >> > >> Hope this is useful. > >> > >> More importantly, if anything here is very wrong then please can a > >> mahout person correct me! > >> > > > > Looks good to me. Suggestions/patches are welcome! > > > > > > > -- > Aleksander M. Stensby > Lead Software Developer and System Architect > Integrasco A/S > E-mail: [email protected] > Tel.: +47 41 22 82 72 > www.integrasco.com > http://twitter.com/Integrasco > http://facebook.com/Integrasco > > Please consider the environment before printing all or any of this e-mail > -- Ted Dunning, CTO DeepDyve
