OK, thanks for all the advice. I'm wondering if this makes sense:' Create an AMI with: 1. Java 1.6 2. Maven 3. svn 4. Mahout's exact Hadoop version 5. A checkout of Mahout
I want to be able to run the trunk version of Mahout with little upgrade pain, both on an individual node and in a cluster. Is this the shortest path? I don't have much experience w/ creating AMIs, but I want my work to be reusable by the community (remember, committers can get credits from Amazon for testing Mahout) After that, I want to convert some of the public datasets to vector format and run some performance benchmarks. Thoughts? On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs > Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is > wget the Mahout's job files and the data from S3, and launch my job. > > --- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a écrit > : > >> De: deneche abdelhakim <[email protected]> >> Objet: Re: Re : Good starting instance for AMI >> À: [email protected] >> Date: Mardi 12 Janvier 2010, 3h44 >> I used Cloudera's with Mahout to test >> the Decision Forest implementation. >> >> --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> >> a écrit : >> >>> De: Grant Ingersoll <[email protected]> >>> Objet: Re: Re : Good starting instance for AMI >>> À: [email protected] >>> Date: Lundi 11 Janvier 2010, 20h51 >>> One quick question for all who >>> responded: >>> How many have tried Mahout with the setup they >>> recommended? >>> >>> -Grant >>> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: >>> >>>> Some comments on Cloudera's Hadoop (CDH) and >> Elastic >>> MapReduce (EMR). >>>> >>>> I have used both to get hadoop jobs up and >> running >>> (although my EMR use has >>>> mostly been limited to running batch Pig scripts >>> weekly). Deciding on which >>>> one to use really depends on what kind of >> job/data >>> you're working with. >>>> >>>> EMR is most useful if you're already storing the >>> dataset you're using on S3 >>>> and plan on running a one-off job. My >> understanding is >>> that it's configured >>>> to use jets3t to stream data from s3 rather than >>> copying it to the cluster, >>>> which is fine for a single pass over a small to >> medium >>> sized dataset, but >>>> obviously slower for multiple passes or larger >>> datasets. The API is also >>>> useful if you have a set workflow that you plan >> to run >>> on a regular basis, >>>> and I often prototype quick and dirty jobs on >> very >>> small EMR clusters to >>>> test how some things run in the wild (obviously >> not >>> the most cost effective >>>> solution, but I've foudn pseudo-distributed mode >>> doesn't catch everything). >>>> >>>> CDH gives you greater control over the initial >> setup >>> and configuration of >>>> your cluster. From my understanding, it's not >> really >>> an AMI. Rather, it's a >>>> set of Python scripts that's been modified from >> the >>> ec2 scripts from >>>> hadoop/contrib with some nifty additions like >> being >>> able to specify and set >>>> up EBS volumes, proxy on the cluster, and some >> others. >>> The scripts use the >>>> boto Python module (a very useful Python module >> for >>> working with EC2) to >>>> make a request to EC2 to setup a specified sized >>> cluster with whatever >>>> vanilla AMI that's specified. It sets up the >> security >>> groups and opens up >>>> the relevant ports and it then passes the init >> script >>> to each of the >>>> instances once they've booted (same user-data >> file >>> setup which is limited to >>>> 16K I believe). The init script tells each node >> to >>> download hadoop (from >>>> Clouderas OS-specific repos) and any other >>> user-specified packages and set >>>> them up. The hadoop config xml is hardcoded into >> the >>> init script (although >>>> you can pass a modified config beforehand). The >> master >>> is started first, and >>>> then the slaves are started so that the slaves >> can be >>> given info about what >>>> NN and JT to connect to (the config uses the >> public >>> DNS I believe to make >>>> things easier to set up). You can use either >> 0.18.3 >>> (CDH) or 0.20 (CDH2) >>>> when it comes to Hadoop versions, although I've >> had >>> mixed results with the >>>> latter. >>>> >>>> Personally, I'd still like some kind of facade >> or >>> something similar to >>>> further abstract things and make it easier for >> others >>> to quickly set up >>>> ad-hoc clusters for 'quick n dirty' jobs. I know >> other >>> libraries like Crane >>>> have been released recently, but given the >> language of >>> choice (Clojure), I >>>> haven't yet had a chance to really investigate. >>>> >>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning >> <[email protected]> >>> wrote: >>>> >>>>> Just use several of these files. >>>>> >>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang >> Chenmin >>> <[email protected] >>>>>> wrote: >>>>> >>>>>> EMR requires S3 bucket, but S3 instance >> has a >>> limit of file >>>>>> size(5GB), so need some extra care here. >> Has >>> any one encounter the file >>>>>> size >>>>>> problem on S3 also? I kind of think that >> it's >>> unreasonable to have a 5G >>>>>> size limit when we want to use the system >> to >>> deal with large data set. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ted Dunning, CTO >>>>> DeepDyve >>>>> >>>> >>>> >>>> >>>> -- >>>> Zaki Rahaman >>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem using Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> >> >> >> >> > > >
