How can you ensure that the S3 buckets and EC2 instances belong to a certain zone?
Ryan On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: > > On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: > >> Hi Tim, >> >> Are you mostly just processing/parsing textual log files? How many >> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >> many did you configure in your JobConf? Just trying to get an idea of >> what to expect in terms of performance. I'm noticing that it takes >> about 16 minutes to transfer about 15GB of textual uncompressed data >> from S3 into HDFS after the cluster has started with 15 nodes. I was >> expecting this to take a shorter amount of time, but maybe I'm >> incorrect in my assumptions. I am also noticing that it takes about 15 >> minutes to parse through the 15GB of data with a 15 node cluster. > > I'm seeing much faster speeds. With 128 nodes running a mapper-only > downloading job, downloading 30 GB takes roughly a minute, less time than > the end of job work (which I assume is HDFS replication and bookkeeping). > More mappers gives you more parallel downloads, of course. I'm using a > Python REST client for S3, and only move data to or from S3 when Hadoop is > done with it. > > Make sure your S3 buckets and EC2 instances are in the same zone. > >
