Re: Amazon Elastic MapReduce and S3

Hitchcock, Andrew Wed, 22 Jul 2009 17:47:58 -0700

I second Todd's recommendation. Elastic MapReduce currently doesn't have a 
mechanism for users to change mapred.tasktracker.map.tasks.maximum. However, by 
default we run more mappers per core than is generally recommended, because 
we've found it results in better performance in the EC2/S3 environment.


Andrew


On 7/22/09 4:45 PM, "Todd Lipcon" <[email protected]> wrote:

On Wed, Jul 22, 2009 at 4:20 PM, Hitchcock, Andrew <[email protected]> wrote:

> We don't have hard numbers on S3 transfer rates. The cluster-wide transfer
> rate depends on a number of factors such as instance type, cluster size, and
> general network congestion.
>

Your mileage of course will vary based on the factors Andrew mentioned, but
in practice I generally see 1-2MB per second per reader for large files off
S3 using the s3n filesystem. So, if you are not CPU bound, I recommend using
a higher-than-usual number of map slots in order to pull enough throughput
out of your S3 bucket.

-Todd


>
> On 7/21/09 6:12 AM, "Larry Compton" <[email protected]> wrote:
>
> Andrew,
>
> Thanks for the information. Can you give me some numbers on transfer
> rates from S3 into HDFS? Processing the content in place in S3 isn't
> an option for us.
>
> Larry
>
> On Fri, Jul 17, 2009 at 5:57 PM, Hitchcock, Andrew<[email protected]> wrote:
> > Hi Larry,
> >
> > I'm an engineer with Elastic MapReduce. The latency from your EC2 cluster
> to S3 is certainly higher than within your cluster using HDFS. However,
> there are ways to mitigate the latency. Of course, the best way to know if
> EMR works with your use case is to give it a try.
> >
> > We recommend using the S3 native file system (S3N) for use with Elastic
> MapReduce, which reads the files from S3 in their native format. The
> standard workflow for an EMR job flow would be to create a step that reads
> from S3 and does the first round of processing. Then you can run any number
> of processing steps on the data using HDFS as the location of your
> intermediate data. When you are done, you can have the last step specify S3N
> as its output location. However, we recommend storing the data from the last
> step into HDFS and then creating a Distcp step to copy it to S3 in bulk.
> >
> > If you want to perform multiple computations on the original data stored
> in S3, then you can Distcp the data down to the HDFS on your cluster to
> avoid reading it from S3 multiple times.
> >
> > Here are the URI schemas you would use with EMR:
> >
> > S3:  s3n://<bucket>/<directory>
> > HDFS: hdfs:///<directory>
> >
> > The data in S3 can be read or written with any standard tool such as S3
> Organizer or s3cmd.
> >
> > Also, in the future, the best place for Elastic MapReduce specific
> questions would be in our developer support forum:
> >
> > http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52
> >
> > Let me know if this answers your questions.
> >
> > Best regards,
> > Andrew Hitchcock
> >
> >
> >
> > ------------------------
> > I have a question about how Amazon Elastic MapReduce handles
> > persistent content stored in S3. I'm interested in using AEMR, but I'm
> > concerned about latency introduced by copying content from S3 into
> > HDFS. With AEMR, is the S3 storage actually an HDFS file system or
> > does HDFS have to be repopulated every time you reinstantiate your
> > Hadoop EC2 nodes?
> >
> > Larry Compton
> >
>
>

Re: Amazon Elastic MapReduce and S3

Reply via email to