One quick question for all who responded:
How many have tried Mahout with the setup they recommended?

-Grant

On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:

> Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR).
> 
> I have used both to get hadoop jobs up and running (although my EMR use has
> mostly been limited to running batch Pig scripts weekly). Deciding on which
> one to use really depends on what kind of job/data you're working with.
> 
> EMR is most useful if you're already storing the dataset you're using on S3
> and plan on running a one-off job. My understanding is that it's configured
> to use jets3t to stream data from s3 rather than copying it to the cluster,
> which is fine for a single pass over a small to medium sized dataset, but
> obviously slower for multiple passes or larger datasets. The API is also
> useful if you have a set workflow that you plan to run on a regular basis,
> and I often prototype quick and dirty jobs on very small EMR clusters to
> test how some things run in the wild (obviously not the most cost effective
> solution, but I've foudn pseudo-distributed mode doesn't catch everything).
> 
> CDH gives you greater control over the initial setup and configuration of
> your cluster. From my understanding, it's not really an AMI. Rather, it's a
> set of Python scripts that's been modified from the ec2 scripts from
> hadoop/contrib with some nifty additions like being able to specify and set
> up EBS volumes, proxy on the cluster, and some others. The scripts use the
> boto Python module (a very useful Python module for working with EC2) to
> make a request to EC2 to setup a specified sized cluster with whatever
> vanilla AMI that's specified. It sets up the security groups and opens up
> the relevant ports and it then passes the init script to each of the
> instances once they've booted (same user-data file setup which is limited to
> 16K I believe). The init script tells each node to download hadoop (from
> Clouderas OS-specific repos) and any other user-specified packages and set
> them up. The hadoop config xml is hardcoded into the init script (although
> you can pass a modified config beforehand). The master is started first, and
> then the slaves are started so that the slaves can be given info about what
> NN and JT to connect to (the config uses the public DNS I believe to make
> things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2)
> when it comes to Hadoop versions, although I've had mixed results with the
> latter.
> 
> Personally, I'd still like some kind of facade or something similar to
> further abstract things and make it easier for others to quickly set up
> ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane
> have been released recently, but given the language of choice (Clojure), I
> haven't yet had a chance to really investigate.
> 
> On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[email protected]> wrote:
> 
>> Just use several of these files.
>> 
>> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <[email protected]
>>> wrote:
>> 
>>> EMR requires S3 bucket, but S3 instance has a limit of file
>>> size(5GB), so need some extra care here. Has any one encounter the file
>>> size
>>> problem on S3 also? I kind of think that it's unreasonable to have a  5G
>>> size limit when we want to use the system to deal with large data set.
>>> 
>> 
>> 
>> 
>> --
>> Ted Dunning, CTO
>> DeepDyve
>> 
> 
> 
> 
> -- 
> Zaki Rahaman

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Reply via email to