Re: Some basic introductory questions

Aleksander Stensby Thu, 17 Sep 2009 23:33:09 -0700

Of course, I'm happy to.
You should probably add a few follow-up questions to questions like:
Do you currently use or develop with Mahout?
- if i answer yes, but not in production - but I plan on using it in
production:)
Same goes for the second question:)


As for the last question, "standalone batch programs with defined file-based
inputs and outputs" is obviously "acceptable" to me, but ideally I would
like the second and third option.

Cheers,
 Aleks

On Thu, Sep 17, 2009 at 11:02 PM, Ted Dunning <[email protected]> wrote:

> Aleksander,
>
> As a (temporarily) naive user of the system, you are in a special position
> to answer a few use-case questions.  Because I think that we need to
> collect
> some of these impressions, I have created a simple form with less than a
> dozen questions about intended use and preferred shape of the software.
>
> Could you go to the URL below to answer those questions?
>
>
> http://spreadsheets.google.com/viewform?formkey=dGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E6MA
> ..
>
> On Thu, Sep 17, 2009 at 11:59 AM, Aleksander Stensby <
> [email protected]> wrote:
>
> > Thanks for all the replies guys!
> > I understand the flow of things and it makes sense, but like Shawn
> pointed
> > out there could still be more abstraction (and once I get my hands dirty
> > I'll try to do my best to contribute here aswell:) )
> >
> > And to Levy: your proposed flow of things makes sense, but what I wanted
> > was
> > to do all that from one entry point. (Ideally, I don't want to do manual
> > stuff here, I want everything to be able to run on a regular basis from a
> > single entrypoint - and then I mean any algorithm etc). And I can
> probably
> > do that just fine by using the Drivers etc.
> >
> > Again, thanks for the replies!
> >
> > Cheers,
> >  Aleks
> >
> > On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <[email protected]
> > >wrote:
> >
> > >
> > > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
> > >
> > >  Hi Aleksander,
> > >>
> > >> I've also been learning how to run mahout's clustering and LDA on our
> > >> cluster.
> > >>
> > >> For k-means, the following series of steps has worked for me:
> > >>
> > >> * build mahout from trunk
> > >>
> > >> * write a program to convert your data to mahout Vectors.  You can
> base
> > >> this on one of the Drivers in the mahout.utils.vectors package (which
> > >> seem designed to work locally).  For bigger datasets you'll probably
> > >> need to  write a simple map reduce job, more like
> > >> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either
> event
> > >> your Vectors need to end up on the dfs.
> > >>
> > >
> > > Yeah, they are designed for local so far, but we should work to extend
> > > them.  I think as Mahout matures, this problem will become less and
> less.
> > >  Ultimately, I'd like to see utilities that simply ingest whatever is
> up
> > on
> > > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a
> _long_
> > > way off, unless someone wants to help drive that.
> > >
> > > Those kinds of utilities would be great contributions from someone
> > looking
> > > to get started contributing.  As I see it, we could leverage Apache
> Tika
> > > with a M/R job to produce the appropriate kinds of things for our
> various
> > > algorithms.
> > >
> > >
> > >> * run clustering with
> org.apache.mahout.clustering.kmeans.KMeansDriver,
> > >> something like:
> > >>  hadoop jar mahout-core-0.2-SNAPSHOT.job
> > >> org.apache.mahout.clustering.kmeans.KMeansDriver -i
> /dfs/input/data/dir
> > >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
> > >> -x <maxIters>
> > >>
> > >> * possibly fix the problem described here
> > >>
> > http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
> > >> -of-KMeans-td24505889.html (solution is at the bottom of the page)
> > >>
> > >> * get all the output files locally
> > >>
> > >> * convert the output to text format with
> > >> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer
> to
> > >> do this on the cluster, but the code seems to expect local files.  If
> > >> you set the name field in your input Vectors in the conversion step to
> a
> > >> suitable ID, then the final output can be a set of cluster centroids,
> > >> each followed by the list of Vector IDs in the corresponding cluster.
> > >>
> > >> Hope this is useful.
> > >>
> > >> More importantly, if anything here is very wrong then please can a
> > >> mahout person correct me!
> > >>
> > >
> > > Looks good to me.  Suggestions/patches are welcome!
> > >
> > >
> >
> >
> > --
> > Aleksander M. Stensby
> > Lead Software Developer and System Architect
> > Integrasco A/S
> > E-mail: [email protected]
> > Tel.: +47 41 22 82 72
> > www.integrasco.com
> > http://twitter.com/Integrasco
> > http://facebook.com/Integrasco
> >
> > Please consider the environment before printing all or any of this e-mail
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Aleksander M. Stensby
Lead Software Developer and System Architect
Integrasco A/S
www.integrasco.com
http://twitter.com/Integrasco
http://facebook.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Re: Some basic introductory questions

Reply via email to