Perfect!. We can have two ami's. Mahout trunk and mahout release version.
On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <[email protected]>wrote: > OK, thanks for all the advice. I'm wondering if this makes sense:' > > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout > > I want to be able to run the trunk version of Mahout with little upgrade > pain, both on an individual node and in a cluster. > > Is this the shortest path? I don't have much experience w/ creating AMIs, > but I want my work to be reusable by the community (remember, committers can > get credits from Amazon for testing Mahout) > > After that, I want to convert some of the public datasets to vector format > and run some performance benchmarks. > > Thoughts? > > On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: > > > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs > Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is > wget the Mahout's job files and the data from S3, and launch my job. > > > > --- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a > écrit : > > > >> De: deneche abdelhakim <[email protected]> > >> Objet: Re: Re : Good starting instance for AMI > >> À: [email protected] > >> Date: Mardi 12 Janvier 2010, 3h44 > >> I used Cloudera's with Mahout to test > >> the Decision Forest implementation. > >> > >> --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> > >> a écrit : > >> > >>> De: Grant Ingersoll <[email protected]> > >>> Objet: Re: Re : Good starting instance for AMI > >>> À: [email protected] > >>> Date: Lundi 11 Janvier 2010, 20h51 > >>> One quick question for all who > >>> responded: > >>> How many have tried Mahout with the setup they > >>> recommended? > >>> > >>> -Grant > >>> > >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > >>> > >>>> Some comments on Cloudera's Hadoop (CDH) and > >> Elastic > >>> MapReduce (EMR). > >>>> > >>>> I have used both to get hadoop jobs up and > >> running > >>> (although my EMR use has > >>>> mostly been limited to running batch Pig scripts > >>> weekly). Deciding on which > >>>> one to use really depends on what kind of > >> job/data > >>> you're working with. > >>>> > >>>> EMR is most useful if you're already storing the > >>> dataset you're using on S3 > >>>> and plan on running a one-off job. My > >> understanding is > >>> that it's configured > >>>> to use jets3t to stream data from s3 rather than > >>> copying it to the cluster, > >>>> which is fine for a single pass over a small to > >> medium > >>> sized dataset, but > >>>> obviously slower for multiple passes or larger > >>> datasets. The API is also > >>>> useful if you have a set workflow that you plan > >> to run > >>> on a regular basis, > >>>> and I often prototype quick and dirty jobs on > >> very > >>> small EMR clusters to > >>>> test how some things run in the wild (obviously > >> not > >>> the most cost effective > >>>> solution, but I've foudn pseudo-distributed mode > >>> doesn't catch everything). > >>>> > >>>> CDH gives you greater control over the initial > >> setup > >>> and configuration of > >>>> your cluster. From my understanding, it's not > >> really > >>> an AMI. Rather, it's a > >>>> set of Python scripts that's been modified from > >> the > >>> ec2 scripts from > >>>> hadoop/contrib with some nifty additions like > >> being > >>> able to specify and set > >>>> up EBS volumes, proxy on the cluster, and some > >> others. > >>> The scripts use the > >>>> boto Python module (a very useful Python module > >> for > >>> working with EC2) to > >>>> make a request to EC2 to setup a specified sized > >>> cluster with whatever > >>>> vanilla AMI that's specified. It sets up the > >> security > >>> groups and opens up > >>>> the relevant ports and it then passes the init > >> script > >>> to each of the > >>>> instances once they've booted (same user-data > >> file > >>> setup which is limited to > >>>> 16K I believe). The init script tells each node > >> to > >>> download hadoop (from > >>>> Clouderas OS-specific repos) and any other > >>> user-specified packages and set > >>>> them up. The hadoop config xml is hardcoded into > >> the > >>> init script (although > >>>> you can pass a modified config beforehand). The > >> master > >>> is started first, and > >>>> then the slaves are started so that the slaves > >> can be > >>> given info about what > >>>> NN and JT to connect to (the config uses the > >> public > >>> DNS I believe to make > >>>> things easier to set up). You can use either > >> 0.18.3 > >>> (CDH) or 0.20 (CDH2) > >>>> when it comes to Hadoop versions, although I've > >> had > >>> mixed results with the > >>>> latter. > >>>> > >>>> Personally, I'd still like some kind of facade > >> or > >>> something similar to > >>>> further abstract things and make it easier for > >> others > >>> to quickly set up > >>>> ad-hoc clusters for 'quick n dirty' jobs. I know > >> other > >>> libraries like Crane > >>>> have been released recently, but given the > >> language of > >>> choice (Clojure), I > >>>> haven't yet had a chance to really investigate. > >>>> > >>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning > >> <[email protected]> > >>> wrote: > >>>> > >>>>> Just use several of these files. > >>>>> > >>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang > >> Chenmin > >>> <[email protected] > >>>>>> wrote: > >>>>> > >>>>>> EMR requires S3 bucket, but S3 instance > >> has a > >>> limit of file > >>>>>> size(5GB), so need some extra care here. > >> Has > >>> any one encounter the file > >>>>>> size > >>>>>> problem on S3 also? I kind of think that > >> it's > >>> unreasonable to have a 5G > >>>>>> size limit when we want to use the system > >> to > >>> deal with large data set. > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Ted Dunning, CTO > >>>>> DeepDyve > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Zaki Rahaman > >>> > >>> -------------------------- > >>> Grant Ingersoll > >>> http://www.lucidimagination.com/ > >>> > >>> Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >>> > >>> > >> > >> > >> > >> > > > > > > > >
