We don't have hard numbers on S3 transfer rates. The cluster-wide transfer rate depends on a number of factors such as instance type, cluster size, and general network congestion.
I'm curious why you think S3 won't work for your use case. Would you like to elaborate? As I described in the previous e-mail, you can use HDFS for intermediate data processing, S3 only needs to be used at the beginning and end of your job flow for saturating the cluster with data and persisting it when you are done. Andrew On 7/21/09 6:12 AM, "Larry Compton" <[email protected]> wrote: Andrew, Thanks for the information. Can you give me some numbers on transfer rates from S3 into HDFS? Processing the content in place in S3 isn't an option for us. Larry On Fri, Jul 17, 2009 at 5:57 PM, Hitchcock, Andrew<[email protected]> wrote: > Hi Larry, > > I'm an engineer with Elastic MapReduce. The latency from your EC2 cluster to > S3 is certainly higher than within your cluster using HDFS. However, there > are ways to mitigate the latency. Of course, the best way to know if EMR > works with your use case is to give it a try. > > We recommend using the S3 native file system (S3N) for use with Elastic > MapReduce, which reads the files from S3 in their native format. The standard > workflow for an EMR job flow would be to create a step that reads from S3 and > does the first round of processing. Then you can run any number of processing > steps on the data using HDFS as the location of your intermediate data. When > you are done, you can have the last step specify S3N as its output location. > However, we recommend storing the data from the last step into HDFS and then > creating a Distcp step to copy it to S3 in bulk. > > If you want to perform multiple computations on the original data stored in > S3, then you can Distcp the data down to the HDFS on your cluster to avoid > reading it from S3 multiple times. > > Here are the URI schemas you would use with EMR: > > S3: s3n://<bucket>/<directory> > HDFS: hdfs:///<directory> > > The data in S3 can be read or written with any standard tool such as S3 > Organizer or s3cmd. > > Also, in the future, the best place for Elastic MapReduce specific questions > would be in our developer support forum: > > http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52 > > Let me know if this answers your questions. > > Best regards, > Andrew Hitchcock > > > > ------------------------ > I have a question about how Amazon Elastic MapReduce handles > persistent content stored in S3. I'm interested in using AEMR, but I'm > concerned about latency introduced by copying content from S3 into > HDFS. With AEMR, is the S3 storage actually an HDFS file system or > does HDFS have to be repopulated every time you reinstantiate your > Hadoop EC2 nodes? > > Larry Compton >
