It would be great if we can bundle lzo codec too We need to put some script to add to the hadoop slaves to run a cluster easily(needn't be optimized configuration)
One problem i see is we may have to make for both 386 and x64 kernel(or we wont be able to run small/large instances respectively) Robin On Mon, Jan 18, 2010 at 8:50 PM, Robin Anil <[email protected]> wrote: > Perfect!. We can have two ami's. Mahout trunk and mahout release version. > > > On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <[email protected]>wrote: > >> OK, thanks for all the advice. I'm wondering if this makes sense:' >> >> Create an AMI with: >> 1. Java 1.6 >> 2. Maven >> 3. svn >> 4. Mahout's exact Hadoop version >> 5. A checkout of Mahout >> >> I want to be able to run the trunk version of Mahout with little upgrade >> pain, both on an individual node and in a cluster. >> >> Is this the shortest path? I don't have much experience w/ creating AMIs, >> but I want my work to be reusable by the community (remember, committers can >> get credits from Amazon for testing Mahout) >> >> After that, I want to convert some of the public datasets to vector format >> and run some performance benchmarks. >> >> Thoughts? >> >> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: >> >> > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs >> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is >> wget the Mahout's job files and the data from S3, and launch my job. >> > >> > --- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a >> écrit : >> > >> >> De: deneche abdelhakim <[email protected]> >> >> Objet: Re: Re : Good starting instance for AMI >> >> À: [email protected] >> >> Date: Mardi 12 Janvier 2010, 3h44 >> >> I used Cloudera's with Mahout to test >> >> the Decision Forest implementation. >> >> >> >> --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> >> >> a écrit : >> >> >> >>> De: Grant Ingersoll <[email protected]> >> >>> Objet: Re: Re : Good starting instance for AMI >> >>> À: [email protected] >> >>> Date: Lundi 11 Janvier 2010, 20h51 >> >>> One quick question for all who >> >>> responded: >> >>> How many have tried Mahout with the setup they >> >>> recommended? >> >>> >> >>> -Grant >> >>> >> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: >> >>> >> >>>> Some comments on Cloudera's Hadoop (CDH) and >> >> Elastic >> >>> MapReduce (EMR). >> >>>> >> >>>> I have used both to get hadoop jobs up and >> >> running >> >>> (although my EMR use has >> >>>> mostly been limited to running batch Pig scripts >> >>> weekly). Deciding on which >> >>>> one to use really depends on what kind of >> >> job/data >> >>> you're working with. >> >>>> >> >>>> EMR is most useful if you're already storing the >> >>> dataset you're using on S3 >> >>>> and plan on running a one-off job. My >> >> understanding is >> >>> that it's configured >> >>>> to use jets3t to stream data from s3 rather than >> >>> copying it to the cluster, >> >>>> which is fine for a single pass over a small to >> >> medium >> >>> sized dataset, but >> >>>> obviously slower for multiple passes or larger >> >>> datasets. The API is also >> >>>> useful if you have a set workflow that you plan >> >> to run >> >>> on a regular basis, >> >>>> and I often prototype quick and dirty jobs on >> >> very >> >>> small EMR clusters to >> >>>> test how some things run in the wild (obviously >> >> not >> >>> the most cost effective >> >>>> solution, but I've foudn pseudo-distributed mode >> >>> doesn't catch everything). >> >>>> >> >>>> CDH gives you greater control over the initial >> >> setup >> >>> and configuration of >> >>>> your cluster. From my understanding, it's not >> >> really >> >>> an AMI. Rather, it's a >> >>>> set of Python scripts that's been modified from >> >> the >> >>> ec2 scripts from >> >>>> hadoop/contrib with some nifty additions like >> >> being >> >>> able to specify and set >> >>>> up EBS volumes, proxy on the cluster, and some >> >> others. >> >>> The scripts use the >> >>>> boto Python module (a very useful Python module >> >> for >> >>> working with EC2) to >> >>>> make a request to EC2 to setup a specified sized >> >>> cluster with whatever >> >>>> vanilla AMI that's specified. It sets up the >> >> security >> >>> groups and opens up >> >>>> the relevant ports and it then passes the init >> >> script >> >>> to each of the >> >>>> instances once they've booted (same user-data >> >> file >> >>> setup which is limited to >> >>>> 16K I believe). The init script tells each node >> >> to >> >>> download hadoop (from >> >>>> Clouderas OS-specific repos) and any other >> >>> user-specified packages and set >> >>>> them up. The hadoop config xml is hardcoded into >> >> the >> >>> init script (although >> >>>> you can pass a modified config beforehand). The >> >> master >> >>> is started first, and >> >>>> then the slaves are started so that the slaves >> >> can be >> >>> given info about what >> >>>> NN and JT to connect to (the config uses the >> >> public >> >>> DNS I believe to make >> >>>> things easier to set up). You can use either >> >> 0.18.3 >> >>> (CDH) or 0.20 (CDH2) >> >>>> when it comes to Hadoop versions, although I've >> >> had >> >>> mixed results with the >> >>>> latter. >> >>>> >> >>>> Personally, I'd still like some kind of facade >> >> or >> >>> something similar to >> >>>> further abstract things and make it easier for >> >> others >> >>> to quickly set up >> >>>> ad-hoc clusters for 'quick n dirty' jobs. I know >> >> other >> >>> libraries like Crane >> >>>> have been released recently, but given the >> >> language of >> >>> choice (Clojure), I >> >>>> haven't yet had a chance to really investigate. >> >>>> >> >>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning >> >> <[email protected]> >> >>> wrote: >> >>>> >> >>>>> Just use several of these files. >> >>>>> >> >>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang >> >> Chenmin >> >>> <[email protected] >> >>>>>> wrote: >> >>>>> >> >>>>>> EMR requires S3 bucket, but S3 instance >> >> has a >> >>> limit of file >> >>>>>> size(5GB), so need some extra care here. >> >> Has >> >>> any one encounter the file >> >>>>>> size >> >>>>>> problem on S3 also? I kind of think that >> >> it's >> >>> unreasonable to have a 5G >> >>>>>> size limit when we want to use the system >> >> to >> >>> deal with large data set. >> >>>>>> >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> Ted Dunning, CTO >> >>>>> DeepDyve >> >>>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Zaki Rahaman >> >>> >> >>> -------------------------- >> >>> Grant Ingersoll >> >>> http://www.lucidimagination.com/ >> >>> >> >>> Search the Lucene ecosystem using Solr/Lucene: >> http://www.lucidimagination.com/search >> >>> >> >>> >> >> >> >> >> >> >> >> >> > >> > >> > >> >> >
