Since i dont have a personal linux box these days. I code on eclipse on windows and fire up an instance attach the ebs and patch and test my code. yes, I have only tried a single node yet.
On Tue, Jan 12, 2010 at 8:55 AM, Liang Chenmin <[email protected]>wrote: > I first followed the tutorial about running mahout on EMR, need some > revision on the command line though. > > On Mon, Jan 11, 2010 at 6:44 PM, deneche abdelhakim <[email protected] > >wrote: > > > I used Cloudera's with Mahout to test the Decision Forest implementation. > > > > --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> a > > écrit : > > > > > De: Grant Ingersoll <[email protected]> > > > Objet: Re: Re : Good starting instance for AMI > > > À: [email protected] > > > Date: Lundi 11 Janvier 2010, 20h51 > > > One quick question for all who > > > responded: > > > How many have tried Mahout with the setup they > > > recommended? > > > > > > -Grant > > > > > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > > > > > Some comments on Cloudera's Hadoop (CDH) and Elastic > > > MapReduce (EMR). > > > > > > > > I have used both to get hadoop jobs up and running > > > (although my EMR use has > > > > mostly been limited to running batch Pig scripts > > > weekly). Deciding on which > > > > one to use really depends on what kind of job/data > > > you're working with. > > > > > > > > EMR is most useful if you're already storing the > > > dataset you're using on S3 > > > > and plan on running a one-off job. My understanding is > > > that it's configured > > > > to use jets3t to stream data from s3 rather than > > > copying it to the cluster, > > > > which is fine for a single pass over a small to medium > > > sized dataset, but > > > > obviously slower for multiple passes or larger > > > datasets. The API is also > > > > useful if you have a set workflow that you plan to run > > > on a regular basis, > > > > and I often prototype quick and dirty jobs on very > > > small EMR clusters to > > > > test how some things run in the wild (obviously not > > > the most cost effective > > > > solution, but I've foudn pseudo-distributed mode > > > doesn't catch everything). > > > > > > > > CDH gives you greater control over the initial setup > > > and configuration of > > > > your cluster. From my understanding, it's not really > > > an AMI. Rather, it's a > > > > set of Python scripts that's been modified from the > > > ec2 scripts from > > > > hadoop/contrib with some nifty additions like being > > > able to specify and set > > > > up EBS volumes, proxy on the cluster, and some others. > > > The scripts use the > > > > boto Python module (a very useful Python module for > > > working with EC2) to > > > > make a request to EC2 to setup a specified sized > > > cluster with whatever > > > > vanilla AMI that's specified. It sets up the security > > > groups and opens up > > > > the relevant ports and it then passes the init script > > > to each of the > > > > instances once they've booted (same user-data file > > > setup which is limited to > > > > 16K I believe). The init script tells each node to > > > download hadoop (from > > > > Clouderas OS-specific repos) and any other > > > user-specified packages and set > > > > them up. The hadoop config xml is hardcoded into the > > > init script (although > > > > you can pass a modified config beforehand). The master > > > is started first, and > > > > then the slaves are started so that the slaves can be > > > given info about what > > > > NN and JT to connect to (the config uses the public > > > DNS I believe to make > > > > things easier to set up). You can use either 0.18.3 > > > (CDH) or 0.20 (CDH2) > > > > when it comes to Hadoop versions, although I've had > > > mixed results with the > > > > latter. > > > > > > > > Personally, I'd still like some kind of facade or > > > something similar to > > > > further abstract things and make it easier for others > > > to quickly set up > > > > ad-hoc clusters for 'quick n dirty' jobs. I know other > > > libraries like Crane > > > > have been released recently, but given the language of > > > choice (Clojure), I > > > > haven't yet had a chance to really investigate. > > > > > > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > >> Just use several of these files. > > > >> > > > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin > > > <[email protected] > > > >>> wrote: > > > >> > > > >>> EMR requires S3 bucket, but S3 instance has a > > > limit of file > > > >>> size(5GB), so need some extra care here. Has > > > any one encounter the file > > > >>> size > > > >>> problem on S3 also? I kind of think that it's > > > unreasonable to have a 5G > > > >>> size limit when we want to use the system to > > > deal with large data set. > > > >>> > > > >> > > > >> > > > >> > > > >> -- > > > >> Ted Dunning, CTO > > > >> DeepDyve > > > >> > > > > > > > > > > > > > > > > -- > > > > Zaki Rahaman > > > > > > -------------------------- > > > Grant Ingersoll > > > http://www.lucidimagination.com/ > > > > > > Search the Lucene ecosystem using Solr/Lucene: > > http://www.lucidimagination.com/search > > > > > > > > > > > > > > > > > -- > Chenmin Liang > Language Technologies Institute, School of Computer Science > Carnegie Mellon University >
