Sounds great. It might be handy to include with the AMI a local maven repo pre-populated with build dependencies to shorten the build time as well.
I wonder if the CDH2 ami's could be used as a starting point? Not sure if you're allowed to unbundle and modify public AMI's. It would certainly be more difficult to start from scratch. Amazon hosts some public datasets for free: http://aws.amazon.com/publicdatasets/ Perhaps the mahout test data in vector form could be bundled up into a snapshot that could be re-used by anyone. On Mon, Jan 18, 2010 at 9:54 AM, Grant Ingersoll <[email protected]> wrote: > OK, thanks for all the advice. I'm wondering if this makes sense:' > > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout > > I want to be able to run the trunk version of Mahout with little upgrade > pain, both on an individual node and in a cluster. > > Is this the shortest path? I don't have much experience w/ creating AMIs, > but I want my work to be reusable by the community (remember, committers can > get credits from Amazon for testing Mahout) > > After that, I want to convert some of the public datasets to vector format > and run some performance benchmarks. > > Thoughts? > > On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: > >> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs >> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is >> wget the Mahout's job files and the data from S3, and launch my job. >> >> --- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a >> écrit : >> >>> De: deneche abdelhakim <[email protected]> >>> Objet: Re: Re : Good starting instance for AMI >>> À: [email protected] >>> Date: Mardi 12 Janvier 2010, 3h44 >>> I used Cloudera's with Mahout to test >>> the Decision Forest implementation. >>> >>> --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> >>> a écrit : >>> >>>> De: Grant Ingersoll <[email protected]> >>>> Objet: Re: Re : Good starting instance for AMI >>>> À: [email protected] >>>> Date: Lundi 11 Janvier 2010, 20h51 >>>> One quick question for all who >>>> responded: >>>> How many have tried Mahout with the setup they >>>> recommended? >>>> >>>> -Grant >>>> >>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: >>>> >>>>> Some comments on Cloudera's Hadoop (CDH) and >>> Elastic >>>> MapReduce (EMR). >>>>> >>>>> I have used both to get hadoop jobs up and >>> running >>>> (although my EMR use has >>>>> mostly been limited to running batch Pig scripts >>>> weekly). Deciding on which >>>>> one to use really depends on what kind of >>> job/data >>>> you're working with. >>>>> >>>>> EMR is most useful if you're already storing the >>>> dataset you're using on S3 >>>>> and plan on running a one-off job. My >>> understanding is >>>> that it's configured >>>>> to use jets3t to stream data from s3 rather than >>>> copying it to the cluster, >>>>> which is fine for a single pass over a small to >>> medium >>>> sized dataset, but >>>>> obviously slower for multiple passes or larger >>>> datasets. The API is also >>>>> useful if you have a set workflow that you plan >>> to run >>>> on a regular basis, >>>>> and I often prototype quick and dirty jobs on >>> very >>>> small EMR clusters to >>>>> test how some things run in the wild (obviously >>> not >>>> the most cost effective >>>>> solution, but I've foudn pseudo-distributed mode >>>> doesn't catch everything). >>>>> >>>>> CDH gives you greater control over the initial >>> setup >>>> and configuration of >>>>> your cluster. From my understanding, it's not >>> really >>>> an AMI. Rather, it's a >>>>> set of Python scripts that's been modified from >>> the >>>> ec2 scripts from >>>>> hadoop/contrib with some nifty additions like >>> being >>>> able to specify and set >>>>> up EBS volumes, proxy on the cluster, and some >>> others. >>>> The scripts use the >>>>> boto Python module (a very useful Python module >>> for >>>> working with EC2) to >>>>> make a request to EC2 to setup a specified sized >>>> cluster with whatever >>>>> vanilla AMI that's specified. It sets up the >>> security >>>> groups and opens up >>>>> the relevant ports and it then passes the init >>> script >>>> to each of the >>>>> instances once they've booted (same user-data >>> file >>>> setup which is limited to >>>>> 16K I believe). The init script tells each node >>> to >>>> download hadoop (from >>>>> Clouderas OS-specific repos) and any other >>>> user-specified packages and set >>>>> them up. The hadoop config xml is hardcoded into >>> the >>>> init script (although >>>>> you can pass a modified config beforehand). The >>> master >>>> is started first, and >>>>> then the slaves are started so that the slaves >>> can be >>>> given info about what >>>>> NN and JT to connect to (the config uses the >>> public >>>> DNS I believe to make >>>>> things easier to set up). You can use either >>> 0.18.3 >>>> (CDH) or 0.20 (CDH2) >>>>> when it comes to Hadoop versions, although I've >>> had >>>> mixed results with the >>>>> latter. >>>>> >>>>> Personally, I'd still like some kind of facade >>> or >>>> something similar to >>>>> further abstract things and make it easier for >>> others >>>> to quickly set up >>>>> ad-hoc clusters for 'quick n dirty' jobs. I know >>> other >>>> libraries like Crane >>>>> have been released recently, but given the >>> language of >>>> choice (Clojure), I >>>>> haven't yet had a chance to really investigate. >>>>> >>>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning >>> <[email protected]> >>>> wrote: >>>>> >>>>>> Just use several of these files. >>>>>> >>>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang >>> Chenmin >>>> <[email protected] >>>>>>> wrote: >>>>>> >>>>>>> EMR requires S3 bucket, but S3 instance >>> has a >>>> limit of file >>>>>>> size(5GB), so need some extra care here. >>> Has >>>> any one encounter the file >>>>>>> size >>>>>>> problem on S3 also? I kind of think that >>> it's >>>> unreasonable to have a 5G >>>>>>> size limit when we want to use the system >>> to >>>> deal with large data set. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ted Dunning, CTO >>>>>> DeepDyve >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Zaki Rahaman >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.com/ >>>> >>>> Search the Lucene ecosystem using Solr/Lucene: >>>> http://www.lucidimagination.com/search >>>> >>>> >>> >>> >>> >>> >> >> >> > >
