On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:

Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.

I'm seeing much faster speeds. With 128 nodes running a mapper-only downloading job, downloading 30 GB takes roughly a minute, less time than the end of job work (which I assume is HDFS replication and bookkeeping). More mappers gives you more parallel downloads, of course. I'm using a Python REST client for S3, and only move data to or from S3 when Hadoop is done with it.

Make sure your S3 buckets and EC2 instances are in the same zone.

Reply via email to