Hi all, I was wondering if any of you have had a similar experience working with Hadoop in Amazon's environment. I've been running a few jobs over the last few months and have noticed them taking more and more time. For instance, I was running teragen/terasort/teravalidate as a benchmark and I've noticed the average execution times of all three jobs have increased by 25-33% this month vs. what I was seeing in December. When I was able to quantify this I started collected some disk IO stats using SAR and dd. I found that on any given node in an EMR cluster, the throughput to the ephemeral storage ranged from <30MB/s to >400MB/s. I also noticed that when using EBS volumes, the throughput would range from ~20MB/s up to 100MB/s. Since those jobs are I/O bound I would have to assume that these huge swings in speed are causing my jobs to take longer. Unfortunately I wasn't collecting the SAR/dd info in December so I don't have anything to compare it too.
Just wondering if others have done these types of performance benchmarks and how they went about tuning Hadoop or tuning how you run your jobs to mediate the effects. If these were small variations in performance I wouldn't be too concerned. But in any given test, I can have a drive running >20x faster/slower than another drive.
