Re: Amazon Elastic MapReduce and S3

Hitchcock, Andrew Wed, 22 Jul 2009 16:21:27 -0700

We don't have hard numbers on S3 transfer rates. The cluster-wide transfer rate 
depends on a number of factors such as instance type, cluster size, and general 
network congestion.


I'm curious why you think S3 won't work for your use case. Would you like to 
elaborate? As I described in the previous e-mail, you can use HDFS for 
intermediate data processing, S3 only needs to be used at the beginning and end 
of your job flow for saturating the cluster with data and persisting it when 
you are done.

Andrew


On 7/21/09 6:12 AM, "Larry Compton" <[email protected]> wrote:

Andrew,

Thanks for the information. Can you give me some numbers on transfer
rates from S3 into HDFS? Processing the content in place in S3 isn't
an option for us.

Larry

On Fri, Jul 17, 2009 at 5:57 PM, Hitchcock, Andrew<[email protected]> wrote:
> Hi Larry,
>
> I'm an engineer with Elastic MapReduce. The latency from your EC2 cluster to 
> S3 is certainly higher than within your cluster using HDFS. However, there 
> are ways to mitigate the latency. Of course, the best way to know if EMR 
> works with your use case is to give it a try.
>
> We recommend using the S3 native file system (S3N) for use with Elastic 
> MapReduce, which reads the files from S3 in their native format. The standard 
> workflow for an EMR job flow would be to create a step that reads from S3 and 
> does the first round of processing. Then you can run any number of processing 
> steps on the data using HDFS as the location of your intermediate data. When 
> you are done, you can have the last step specify S3N as its output location. 
> However, we recommend storing the data from the last step into HDFS and then 
> creating a Distcp step to copy it to S3 in bulk.
>
> If you want to perform multiple computations on the original data stored in 
> S3, then you can Distcp the data down to the HDFS on your cluster to avoid 
> reading it from S3 multiple times.
>
> Here are the URI schemas you would use with EMR:
>
> S3:  s3n://<bucket>/<directory>
> HDFS: hdfs:///<directory>
>
> The data in S3 can be read or written with any standard tool such as S3 
> Organizer or s3cmd.
>
> Also, in the future, the best place for Elastic MapReduce specific questions 
> would be in our developer support forum:
>
> http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52
>
> Let me know if this answers your questions.
>
> Best regards,
> Andrew Hitchcock
>
>
>
> ------------------------
> I have a question about how Amazon Elastic MapReduce handles
> persistent content stored in S3. I'm interested in using AEMR, but I'm
> concerned about latency introduced by copying content from S3 into
> HDFS. With AEMR, is the S3 storage actually an HDFS file system or
> does HDFS have to be repopulated every time you reinstantiate your
> Hadoop EC2 nodes?
>
> Larry Compton
>

Re: Amazon Elastic MapReduce and S3

Reply via email to