Of course, I'm happy to. You should probably add a few follow-up questions to questions like: Do you currently use or develop with Mahout? - if i answer yes, but not in production - but I plan on using it in production:) Same goes for the second question:)
As for the last question, "standalone batch programs with defined file-based inputs and outputs" is obviously "acceptable" to me, but ideally I would like the second and third option. Cheers, Aleks On Thu, Sep 17, 2009 at 11:02 PM, Ted Dunning <[email protected]> wrote: > Aleksander, > > As a (temporarily) naive user of the system, you are in a special position > to answer a few use-case questions. Because I think that we need to > collect > some of these impressions, I have created a simple form with less than a > dozen questions about intended use and preferred shape of the software. > > Could you go to the URL below to answer those questions? > > > http://spreadsheets.google.com/viewform?formkey=dGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E6MA > .. > > On Thu, Sep 17, 2009 at 11:59 AM, Aleksander Stensby < > [email protected]> wrote: > > > Thanks for all the replies guys! > > I understand the flow of things and it makes sense, but like Shawn > pointed > > out there could still be more abstraction (and once I get my hands dirty > > I'll try to do my best to contribute here aswell:) ) > > > > And to Levy: your proposed flow of things makes sense, but what I wanted > > was > > to do all that from one entry point. (Ideally, I don't want to do manual > > stuff here, I want everything to be able to run on a regular basis from a > > single entrypoint - and then I mean any algorithm etc). And I can > probably > > do that just fine by using the Drivers etc. > > > > Again, thanks for the replies! > > > > Cheers, > > Aleks > > > > On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <[email protected] > > >wrote: > > > > > > > > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote: > > > > > > Hi Aleksander, > > >> > > >> I've also been learning how to run mahout's clustering and LDA on our > > >> cluster. > > >> > > >> For k-means, the following series of steps has worked for me: > > >> > > >> * build mahout from trunk > > >> > > >> * write a program to convert your data to mahout Vectors. You can > base > > >> this on one of the Drivers in the mahout.utils.vectors package (which > > >> seem designed to work locally). For bigger datasets you'll probably > > >> need to write a simple map reduce job, more like > > >> mahout.clustering.syntheticcontrol.canopy.InputDriver. In either > event > > >> your Vectors need to end up on the dfs. > > >> > > > > > > Yeah, they are designed for local so far, but we should work to extend > > > them. I think as Mahout matures, this problem will become less and > less. > > > Ultimately, I'd like to see utilities that simply ingest whatever is > up > > on > > > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a > _long_ > > > way off, unless someone wants to help drive that. > > > > > > Those kinds of utilities would be great contributions from someone > > looking > > > to get started contributing. As I see it, we could leverage Apache > Tika > > > with a M/R job to produce the appropriate kinds of things for our > various > > > algorithms. > > > > > > > > >> * run clustering with > org.apache.mahout.clustering.kmeans.KMeansDriver, > > >> something like: > > >> hadoop jar mahout-core-0.2-SNAPSHOT.job > > >> org.apache.mahout.clustering.kmeans.KMeansDriver -i > /dfs/input/data/dir > > >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters> > > >> -x <maxIters> > > >> > > >> * possibly fix the problem described here > > >> > > http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run > > >> -of-KMeans-td24505889.html (solution is at the bottom of the page) > > >> > > >> * get all the output files locally > > >> > > >> * convert the output to text format with > > >> org.apache.mahout.utils.clustering.ClusterDumper. It might be nicer > to > > >> do this on the cluster, but the code seems to expect local files. If > > >> you set the name field in your input Vectors in the conversion step to > a > > >> suitable ID, then the final output can be a set of cluster centroids, > > >> each followed by the list of Vector IDs in the corresponding cluster. > > >> > > >> Hope this is useful. > > >> > > >> More importantly, if anything here is very wrong then please can a > > >> mahout person correct me! > > >> > > > > > > Looks good to me. Suggestions/patches are welcome! > > > > > > > > > > > > -- > > Aleksander M. Stensby > > Lead Software Developer and System Architect > > Integrasco A/S > > E-mail: [email protected] > > Tel.: +47 41 22 82 72 > > www.integrasco.com > > http://twitter.com/Integrasco > > http://facebook.com/Integrasco > > > > Please consider the environment before printing all or any of this e-mail > > > > > > -- > Ted Dunning, CTO > DeepDyve > -- Aleksander M. Stensby Lead Software Developer and System Architect Integrasco A/S www.integrasco.com http://twitter.com/Integrasco http://facebook.com/Integrasco Please consider the environment before printing all or any of this e-mail
