Hi Steve,

> -are you asking for XL or bigger VMs to get the full physical host and
less network throtting?
I've used m1.large, m1.xlarge and cc1.4xlarge instance types and seen this
issue on all of them.  Speaking specifically about cc1.4xlarge instances, I
see disk read speeds for ephemeral storage vary from ~800MB/s down to as
~200MB/s between nodes.  For writes to the ephemeral storage it goes
anywhere ~50MB/s down to ~15MB/s.  Writes to EBS are equally inconsistent
regardless of instance type.  When I read and write concurrently to both
ephemeral drives, throughput falls to an aggregate of ~50MB/s for write and
40MB/s for read.  In regards to the benchmarking, I'm running simple dd
commands like these:

sudo dd of=/dev/xvdb if=/dev/zero bs=1M count=8192 oflag=direct
sudo dd if=/dev/xvdc of=/dev/null bs=1M count=8192 iflag=direct

Then sending a "kill -s USR1" to get the dd's to dump stats. I run the dd
commands one at a time for each ephemeral disk/ebs volume and then I run
read/write against every disk concurrently to get an aggregate performance
level.


>-does it behave differently if you bring up clusters on different sites?
I've tested in multiple availability zones in the us-west and us-east
regions and the experience has been the same. For cc1.4xlarge instances I've
only tested in us-east.

On Tue, Feb 1, 2011 at 7:48 AM, Steve Loughran <[email protected]> wrote:

> On 31/01/11 23:22, Aaron Eng wrote:
>
>> Hi all,
>>
>> I was wondering if any of you have had a similar experience working with
>> Hadoop in Amazon's environment.  I've been running a few jobs over the
>> last
>> few months and have noticed them taking more and more time.  For instance,
>> I
>> was running teragen/terasort/teravalidate as a benchmark and I've noticed
>> the average execution times of all three jobs have increased by 25-33%
>> this
>> month vs. what I was seeing in December.  When I was able to quantify this
>> I
>> started collected some disk IO stats using SAR and dd.  I found that on
>> any
>> given node in an EMR cluster, the throughput to the ephemeral storage
>> ranged
>> from<30MB/s to>400MB/s.  I also noticed that when using EBS volumes, the
>> throughput would range from ~20MB/s up to 100MB/s.  Since those jobs are
>> I/O
>> bound I would have to assume that these huge swings in speed are causing
>> my
>> jobs to take longer.  Unfortunately I wasn't collecting the SAR/dd info in
>> December so I don't have anything to compare it too.
>>
>
> -are you asking for XL or bigger VMs to get the full physical host and less
> network throtting?
>
> -does it behave differently if you bring up clusters on different sites?
>
>
>  Just wondering if others have done these types of performance benchmarks
>> and
>> how they went about tuning Hadoop or tuning how you run your jobs to
>> mediate
>> the effects.  If these were small variations in performance I wouldn't be
>> too concerned.  But in any given test, I can have a drive running>20x
>> faster/slower than another drive.
>>
>>
>

Reply via email to