Re: Amazon Elastic MapReduce and S3

Hitchcock, Andrew Fri, 17 Jul 2009 14:58:04 -0700

Hi Larry,

I'm an engineer with Elastic MapReduce. The latency from your EC2 cluster to S3 
is certainly higher than within your cluster using HDFS. However, there are 
ways to mitigate the latency. Of course, the best way to know if EMR works with 
your use case is to give it a try.


We recommend using the S3 native file system (S3N) for use with Elastic 
MapReduce, which reads the files from S3 in their native format. The standard 
workflow for an EMR job flow would be to create a step that reads from S3 and 
does the first round of processing. Then you can run any number of processing 
steps on the data using HDFS as the location of your intermediate data. When 
you are done, you can have the last step specify S3N as its output location. 
However, we recommend storing the data from the last step into HDFS and then 
creating a Distcp step to copy it to S3 in bulk.

If you want to perform multiple computations on the original data stored in S3, 
then you can Distcp the data down to the HDFS on your cluster to avoid reading 
it from S3 multiple times.

Here are the URI schemas you would use with EMR:

S3:  s3n://<bucket>/<directory>
HDFS: hdfs:///<directory>

The data in S3 can be read or written with any standard tool such as S3 
Organizer or s3cmd.

Also, in the future, the best place for Elastic MapReduce specific questions 
would be in our developer support forum:

http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52

Let me know if this answers your questions.

Best regards,
Andrew Hitchcock



------------------------
I have a question about how Amazon Elastic MapReduce handles
persistent content stored in S3. I'm interested in using AEMR, but I'm
concerned about latency introduced by copying content from S3 into
HDFS. With AEMR, is the S3 storage actually an HDFS file system or
does HDFS have to be repopulated every time you reinstantiate your
Hadoop EC2 nodes?

Larry Compton

Re: Amazon Elastic MapReduce and S3

Reply via email to