Pros: - Easier to build out and tear down clusters vs. using physical machines in a lab - Easier to scale up and scale down a cluster as needed
Cons: - Reliability. In my experience I've had machines die, had machines fail to start up, had network outages between Amazon instances, etc. These problems have occurred at a far more significant rate than any physical lab I have ever administered. - Money. You get charged for problems with their system. Need to add storage space to a node? That means renting space from EBS which you then need to actually spend time formatting to ext3 so you can use it with Hadoop. So every time you want to use storage, you're paying Amazon to format it because you can't tell EBS that you want an ext3 volume. - Visibility. Amazon loves to report that all their services are working properly on their website, meanwhile, the reality is that they only report issues if they are extremely major. Just yesterday they reported "increased latency" on their us-east-1 region. In reality, "increased latency" means >50% of my Amazon API calls were timing out, I could not create new instances and for about 2 hours I could not destroy the instances I had already spun up. Hows that for ya? Paying them for machines that they won't let me terminate... This applies to both EMR and clusters you'd create yourself in EC2. So if you're willing to put up with not having much control over or insight into the environment you're using, Amazon may be a good bet. But don't expect it to be all rainbows and daisies, you will run into problems at various points which you did not cause and can not correct yourself, you'll have to wait for Amazon to get their environment functioning. On Thu, Dec 9, 2010 at 8:17 AM, Mark <[email protected]> wrote: > Does anyone have any thoughts/experiences on running Hadoop in AWS? What > are some pros/cons? > > Are there any good AMI's out there for this? > > Thanks for any advice. >
