Re: Some basic introductory questions

Ted Dunning Thu, 17 Sep 2009 14:03:35 -0700

Aleksander,

As a (temporarily) naive user of the system, you are in a special position
to answer a few use-case questions.  Because I think that we need to collect
some of these impressions, I have created a simple form with less than a
dozen questions about intended use and preferred shape of the software.


Could you go to the URL below to answer those questions?

http://spreadsheets.google.com/viewform?formkey=dGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E6MA
..

On Thu, Sep 17, 2009 at 11:59 AM, Aleksander Stensby <
[email protected]> wrote:

> Thanks for all the replies guys!
> I understand the flow of things and it makes sense, but like Shawn pointed
> out there could still be more abstraction (and once I get my hands dirty
> I'll try to do my best to contribute here aswell:) )
>
> And to Levy: your proposed flow of things makes sense, but what I wanted
> was
> to do all that from one entry point. (Ideally, I don't want to do manual
> stuff here, I want everything to be able to run on a regular basis from a
> single entrypoint - and then I mean any algorithm etc). And I can probably
> do that just fine by using the Drivers etc.
>
> Again, thanks for the replies!
>
> Cheers,
>  Aleks
>
> On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <[email protected]
> >wrote:
>
> >
> > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
> >
> >  Hi Aleksander,
> >>
> >> I've also been learning how to run mahout's clustering and LDA on our
> >> cluster.
> >>
> >> For k-means, the following series of steps has worked for me:
> >>
> >> * build mahout from trunk
> >>
> >> * write a program to convert your data to mahout Vectors.  You can base
> >> this on one of the Drivers in the mahout.utils.vectors package (which
> >> seem designed to work locally).  For bigger datasets you'll probably
> >> need to  write a simple map reduce job, more like
> >> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either event
> >> your Vectors need to end up on the dfs.
> >>
> >
> > Yeah, they are designed for local so far, but we should work to extend
> > them.  I think as Mahout matures, this problem will become less and less.
> >  Ultimately, I'd like to see utilities that simply ingest whatever is up
> on
> > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_
> > way off, unless someone wants to help drive that.
> >
> > Those kinds of utilities would be great contributions from someone
> looking
> > to get started contributing.  As I see it, we could leverage Apache Tika
> > with a M/R job to produce the appropriate kinds of things for our various
> > algorithms.
> >
> >
> >> * run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
> >> something like:
> >>  hadoop jar mahout-core-0.2-SNAPSHOT.job
> >> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir
> >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
> >> -x <maxIters>
> >>
> >> * possibly fix the problem described here
> >>
> http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
> >> -of-KMeans-td24505889.html (solution is at the bottom of the page)
> >>
> >> * get all the output files locally
> >>
> >> * convert the output to text format with
> >> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer to
> >> do this on the cluster, but the code seems to expect local files.  If
> >> you set the name field in your input Vectors in the conversion step to a
> >> suitable ID, then the final output can be a set of cluster centroids,
> >> each followed by the list of Vector IDs in the corresponding cluster.
> >>
> >> Hope this is useful.
> >>
> >> More importantly, if anything here is very wrong then please can a
> >> mahout person correct me!
> >>
> >
> > Looks good to me.  Suggestions/patches are welcome!
> >
> >
>
>
> --
> Aleksander M. Stensby
> Lead Software Developer and System Architect
> Integrasco A/S
> E-mail: [email protected]
> Tel.: +47 41 22 82 72
> www.integrasco.com
> http://twitter.com/Integrasco
> http://facebook.com/Integrasco
>
> Please consider the environment before printing all or any of this e-mail
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Some basic introductory questions

Reply via email to