On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:
Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to transfer about 15GB of textual uncompressed data from S3 into HDFS after the cluster has started with 15 nodes. I was expecting this to take a shorter amount of time, but maybe I'm incorrect in my assumptions. I am also noticing that it takes about 15 minutes to parse through the 15GB of data with a 15 node cluster.
I'm seeing much faster speeds. With 128 nodes running a mapper-only downloading job, downloading 30 GB takes roughly a minute, less time than the end of job work (which I assume is HDFS replication and bookkeeping). More mappers gives you more parallel downloads, of course. I'm using a Python REST client for S3, and only move data to or from S3 when Hadoop is done with it.
Make sure your S3 buckets and EC2 instances are in the same zone.
