One quick question for all who responded: How many have tried Mahout with the setup they recommended?
-Grant On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR). > > I have used both to get hadoop jobs up and running (although my EMR use has > mostly been limited to running batch Pig scripts weekly). Deciding on which > one to use really depends on what kind of job/data you're working with. > > EMR is most useful if you're already storing the dataset you're using on S3 > and plan on running a one-off job. My understanding is that it's configured > to use jets3t to stream data from s3 rather than copying it to the cluster, > which is fine for a single pass over a small to medium sized dataset, but > obviously slower for multiple passes or larger datasets. The API is also > useful if you have a set workflow that you plan to run on a regular basis, > and I often prototype quick and dirty jobs on very small EMR clusters to > test how some things run in the wild (obviously not the most cost effective > solution, but I've foudn pseudo-distributed mode doesn't catch everything). > > CDH gives you greater control over the initial setup and configuration of > your cluster. From my understanding, it's not really an AMI. Rather, it's a > set of Python scripts that's been modified from the ec2 scripts from > hadoop/contrib with some nifty additions like being able to specify and set > up EBS volumes, proxy on the cluster, and some others. The scripts use the > boto Python module (a very useful Python module for working with EC2) to > make a request to EC2 to setup a specified sized cluster with whatever > vanilla AMI that's specified. It sets up the security groups and opens up > the relevant ports and it then passes the init script to each of the > instances once they've booted (same user-data file setup which is limited to > 16K I believe). The init script tells each node to download hadoop (from > Clouderas OS-specific repos) and any other user-specified packages and set > them up. The hadoop config xml is hardcoded into the init script (although > you can pass a modified config beforehand). The master is started first, and > then the slaves are started so that the slaves can be given info about what > NN and JT to connect to (the config uses the public DNS I believe to make > things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2) > when it comes to Hadoop versions, although I've had mixed results with the > latter. > > Personally, I'd still like some kind of facade or something similar to > further abstract things and make it easier for others to quickly set up > ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane > have been released recently, but given the language of choice (Clojure), I > haven't yet had a chance to really investigate. > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[email protected]> wrote: > >> Just use several of these files. >> >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <[email protected] >>> wrote: >> >>> EMR requires S3 bucket, but S3 instance has a limit of file >>> size(5GB), so need some extra care here. Has any one encounter the file >>> size >>> problem on S3 also? I kind of think that it's unreasonable to have a 5G >>> size limit when we want to use the system to deal with large data set. >>> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> > > > > -- > Zaki Rahaman -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
