OK, thanks for all the advice.  I'm wondering if this makes sense:'

Create an AMI with:
1. Java 1.6
2. Maven
3. svn
4. Mahout's exact Hadoop version
5. A checkout of Mahout

I want to be able to run the trunk version of Mahout with little upgrade pain, 
both on an individual node and in a cluster.

Is this the shortest path?  I don't have much experience w/ creating AMIs, but 
I want my work to be reusable by the community (remember, committers can get 
credits from Amazon for testing Mahout)

After that, I want to convert some of the public datasets to vector format and 
run some performance benchmarks.

Thoughts?

On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote:

> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs 
> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is 
> wget the Mahout's job files and the data from S3, and launch my job.
> 
> --- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a écrit 
> :
> 
>> De: deneche abdelhakim <[email protected]>
>> Objet: Re: Re : Good starting instance for AMI
>> À: [email protected]
>> Date: Mardi 12 Janvier 2010, 3h44
>> I used Cloudera's with Mahout to test
>> the Decision Forest implementation.
>> 
>> --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]>
>> a écrit :
>> 
>>> De: Grant Ingersoll <[email protected]>
>>> Objet: Re: Re : Good starting instance for AMI
>>> À: [email protected]
>>> Date: Lundi 11 Janvier 2010, 20h51
>>> One quick question for all who
>>> responded:
>>> How many have tried Mahout with the setup they
>>> recommended?
>>> 
>>> -Grant
>>> 
>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
>>> 
>>>> Some comments on Cloudera's Hadoop (CDH) and
>> Elastic
>>> MapReduce (EMR).
>>>> 
>>>> I have used both to get hadoop jobs up and
>> running
>>> (although my EMR use has
>>>> mostly been limited to running batch Pig scripts
>>> weekly). Deciding on which
>>>> one to use really depends on what kind of
>> job/data
>>> you're working with.
>>>> 
>>>> EMR is most useful if you're already storing the
>>> dataset you're using on S3
>>>> and plan on running a one-off job. My
>> understanding is
>>> that it's configured
>>>> to use jets3t to stream data from s3 rather than
>>> copying it to the cluster,
>>>> which is fine for a single pass over a small to
>> medium
>>> sized dataset, but
>>>> obviously slower for multiple passes or larger
>>> datasets. The API is also
>>>> useful if you have a set workflow that you plan
>> to run
>>> on a regular basis,
>>>> and I often prototype quick and dirty jobs on
>> very
>>> small EMR clusters to
>>>> test how some things run in the wild (obviously
>> not
>>> the most cost effective
>>>> solution, but I've foudn pseudo-distributed mode
>>> doesn't catch everything).
>>>> 
>>>> CDH gives you greater control over the initial
>> setup
>>> and configuration of
>>>> your cluster. From my understanding, it's not
>> really
>>> an AMI. Rather, it's a
>>>> set of Python scripts that's been modified from
>> the
>>> ec2 scripts from
>>>> hadoop/contrib with some nifty additions like
>> being
>>> able to specify and set
>>>> up EBS volumes, proxy on the cluster, and some
>> others.
>>> The scripts use the
>>>> boto Python module (a very useful Python module
>> for
>>> working with EC2) to
>>>> make a request to EC2 to setup a specified sized
>>> cluster with whatever
>>>> vanilla AMI that's specified. It sets up the
>> security
>>> groups and opens up
>>>> the relevant ports and it then passes the init
>> script
>>> to each of the
>>>> instances once they've booted (same user-data
>> file
>>> setup which is limited to
>>>> 16K I believe). The init script tells each node
>> to
>>> download hadoop (from
>>>> Clouderas OS-specific repos) and any other
>>> user-specified packages and set
>>>> them up. The hadoop config xml is hardcoded into
>> the
>>> init script (although
>>>> you can pass a modified config beforehand). The
>> master
>>> is started first, and
>>>> then the slaves are started so that the slaves
>> can be
>>> given info about what
>>>> NN and JT to connect to (the config uses the
>> public
>>> DNS I believe to make
>>>> things easier to set up). You can use either
>> 0.18.3
>>> (CDH) or 0.20 (CDH2)
>>>> when it comes to Hadoop versions, although I've
>> had
>>> mixed results with the
>>>> latter.
>>>> 
>>>> Personally, I'd still like some kind of facade
>> or
>>> something similar to
>>>> further abstract things and make it easier for
>> others
>>> to quickly set up
>>>> ad-hoc clusters for 'quick n dirty' jobs. I know
>> other
>>> libraries like Crane
>>>> have been released recently, but given the
>> language of
>>> choice (Clojure), I
>>>> haven't yet had a chance to really investigate.
>>>> 
>>>> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
>> <[email protected]>
>>> wrote:
>>>> 
>>>>> Just use several of these files.
>>>>> 
>>>>> On Sun, Jan 10, 2010 at 10:44 PM, Liang
>> Chenmin
>>> <[email protected]
>>>>>> wrote:
>>>>> 
>>>>>> EMR requires S3 bucket, but S3 instance
>> has a
>>> limit of file
>>>>>> size(5GB), so need some extra care here.
>> Has
>>> any one encounter the file
>>>>>> size
>>>>>> problem on S3 also? I kind of think that
>> it's
>>> unreasonable to have a  5G
>>>>>> size limit when we want to use the system
>> to
>>> deal with large data set.
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ted Dunning, CTO
>>>>> DeepDyve
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Zaki Rahaman
>>> 
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>> 
>>> Search the Lucene ecosystem using Solr/Lucene: 
>>> http://www.lucidimagination.com/search
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
> 

Reply via email to