Benchmarking performance in Amazon EC2/EMR environment

Aaron Eng Mon, 31 Jan 2011 15:22:36 -0800

Hi all,

I was wondering if any of you have had a similar experience working with
Hadoop in Amazon's environment.  I've been running a few jobs over the last
few months and have noticed them taking more and more time.  For instance, I
was running teragen/terasort/teravalidate as a benchmark and I've noticed
the average execution times of all three jobs have increased by 25-33% this
month vs. what I was seeing in December.  When I was able to quantify this I
started collected some disk IO stats using SAR and dd.  I found that on any
given node in an EMR cluster, the throughput to the ephemeral storage ranged
from <30MB/s to >400MB/s.  I also noticed that when using EBS volumes, the
throughput would range from ~20MB/s up to 100MB/s.  Since those jobs are I/O
bound I would have to assume that these huge swings in speed are causing my
jobs to take longer.  Unfortunately I wasn't collecting the SAR/dd info in
December so I don't have anything to compare it too.


Just wondering if others have done these types of performance benchmarks and
how they went about tuning Hadoop or tuning how you run your jobs to mediate
the effects.  If these were small variations in performance I wouldn't be
too concerned.  But in any given test, I can have a drive running >20x
faster/slower than another drive.

Benchmarking performance in Amazon EC2/EMR environment

Reply via email to