It would be great if we can bundle lzo codec too

We need to put some script to add to the hadoop slaves to run a cluster
easily(needn't be optimized configuration)

One problem i see is we may have to make for both 386 and x64 kernel(or we
wont be able to run small/large instances respectively)
Robin

On Mon, Jan 18, 2010 at 8:50 PM, Robin Anil <[email protected]> wrote:

> Perfect!. We can have two ami's. Mahout trunk and mahout release version.
>
>
> On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <[email protected]>wrote:
>
>> OK, thanks for all the advice.  I'm wondering if this makes sense:'
>>
>> Create an AMI with:
>> 1. Java 1.6
>> 2. Maven
>> 3. svn
>> 4. Mahout's exact Hadoop version
>> 5. A checkout of Mahout
>>
>> I want to be able to run the trunk version of Mahout with little upgrade
>> pain, both on an individual node and in a cluster.
>>
>> Is this the shortest path?  I don't have much experience w/ creating AMIs,
>> but I want my work to be reusable by the community (remember, committers can
>> get credits from Amazon for testing Mahout)
>>
>> After that, I want to convert some of the public datasets to vector format
>> and run some performance benchmarks.
>>
>> Thoughts?
>>
>> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:
>>
>> > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs
>> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is
>> wget the Mahout's job files and the data from S3, and launch my job.
>> >
>> > --- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a
>> écrit :
>> >
>> >> De: deneche abdelhakim <[email protected]>
>> >> Objet: Re: Re : Good starting instance for AMI
>> >> À: [email protected]
>> >> Date: Mardi 12 Janvier 2010, 3h44
>> >> I used Cloudera's with Mahout to test
>> >> the Decision Forest implementation.
>> >>
>> >> --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]>
>> >> a écrit :
>> >>
>> >>> De: Grant Ingersoll <[email protected]>
>> >>> Objet: Re: Re : Good starting instance for AMI
>> >>> À: [email protected]
>> >>> Date: Lundi 11 Janvier 2010, 20h51
>> >>> One quick question for all who
>> >>> responded:
>> >>> How many have tried Mahout with the setup they
>> >>> recommended?
>> >>>
>> >>> -Grant
>> >>>
>> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
>> >>>
>> >>>> Some comments on Cloudera's Hadoop (CDH) and
>> >> Elastic
>> >>> MapReduce (EMR).
>> >>>>
>> >>>> I have used both to get hadoop jobs up and
>> >> running
>> >>> (although my EMR use has
>> >>>> mostly been limited to running batch Pig scripts
>> >>> weekly). Deciding on which
>> >>>> one to use really depends on what kind of
>> >> job/data
>> >>> you're working with.
>> >>>>
>> >>>> EMR is most useful if you're already storing the
>> >>> dataset you're using on S3
>> >>>> and plan on running a one-off job. My
>> >> understanding is
>> >>> that it's configured
>> >>>> to use jets3t to stream data from s3 rather than
>> >>> copying it to the cluster,
>> >>>> which is fine for a single pass over a small to
>> >> medium
>> >>> sized dataset, but
>> >>>> obviously slower for multiple passes or larger
>> >>> datasets. The API is also
>> >>>> useful if you have a set workflow that you plan
>> >> to run
>> >>> on a regular basis,
>> >>>> and I often prototype quick and dirty jobs on
>> >> very
>> >>> small EMR clusters to
>> >>>> test how some things run in the wild (obviously
>> >> not
>> >>> the most cost effective
>> >>>> solution, but I've foudn pseudo-distributed mode
>> >>> doesn't catch everything).
>> >>>>
>> >>>> CDH gives you greater control over the initial
>> >> setup
>> >>> and configuration of
>> >>>> your cluster. From my understanding, it's not
>> >> really
>> >>> an AMI. Rather, it's a
>> >>>> set of Python scripts that's been modified from
>> >> the
>> >>> ec2 scripts from
>> >>>> hadoop/contrib with some nifty additions like
>> >> being
>> >>> able to specify and set
>> >>>> up EBS volumes, proxy on the cluster, and some
>> >> others.
>> >>> The scripts use the
>> >>>> boto Python module (a very useful Python module
>> >> for
>> >>> working with EC2) to
>> >>>> make a request to EC2 to setup a specified sized
>> >>> cluster with whatever
>> >>>> vanilla AMI that's specified. It sets up the
>> >> security
>> >>> groups and opens up
>> >>>> the relevant ports and it then passes the init
>> >> script
>> >>> to each of the
>> >>>> instances once they've booted (same user-data
>> >> file
>> >>> setup which is limited to
>> >>>> 16K I believe). The init script tells each node
>> >> to
>> >>> download hadoop (from
>> >>>> Clouderas OS-specific repos) and any other
>> >>> user-specified packages and set
>> >>>> them up. The hadoop config xml is hardcoded into
>> >> the
>> >>> init script (although
>> >>>> you can pass a modified config beforehand). The
>> >> master
>> >>> is started first, and
>> >>>> then the slaves are started so that the slaves
>> >> can be
>> >>> given info about what
>> >>>> NN and JT to connect to (the config uses the
>> >> public
>> >>> DNS I believe to make
>> >>>> things easier to set up). You can use either
>> >> 0.18.3
>> >>> (CDH) or 0.20 (CDH2)
>> >>>> when it comes to Hadoop versions, although I've
>> >> had
>> >>> mixed results with the
>> >>>> latter.
>> >>>>
>> >>>> Personally, I'd still like some kind of facade
>> >> or
>> >>> something similar to
>> >>>> further abstract things and make it easier for
>> >> others
>> >>> to quickly set up
>> >>>> ad-hoc clusters for 'quick n dirty' jobs. I know
>> >> other
>> >>> libraries like Crane
>> >>>> have been released recently, but given the
>> >> language of
>> >>> choice (Clojure), I
>> >>>> haven't yet had a chance to really investigate.
>> >>>>
>> >>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
>> >> <[email protected]>
>> >>> wrote:
>> >>>>
>> >>>>> Just use several of these files.
>> >>>>>
>> >>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
>> >> Chenmin
>> >>> <[email protected]
>> >>>>>> wrote:
>> >>>>>
>> >>>>>> EMR requires S3 bucket, but S3 instance
>> >> has a
>> >>> limit of file
>> >>>>>> size(5GB), so need some extra care here.
>> >> Has
>> >>> any one encounter the file
>> >>>>>> size
>> >>>>>> problem on S3 also? I kind of think that
>> >> it's
>> >>> unreasonable to have a  5G
>> >>>>>> size limit when we want to use the system
>> >> to
>> >>> deal with large data set.
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Ted Dunning, CTO
>> >>>>> DeepDyve
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Zaki Rahaman
>> >>>
>> >>> --------------------------
>> >>> Grant Ingersoll
>> >>> http://www.lucidimagination.com/
>> >>>
>> >>> Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>>
>>
>

Reply via email to