Hi Larry, I'm an engineer with Elastic MapReduce. The latency from your EC2 cluster to S3 is certainly higher than within your cluster using HDFS. However, there are ways to mitigate the latency. Of course, the best way to know if EMR works with your use case is to give it a try.
We recommend using the S3 native file system (S3N) for use with Elastic MapReduce, which reads the files from S3 in their native format. The standard workflow for an EMR job flow would be to create a step that reads from S3 and does the first round of processing. Then you can run any number of processing steps on the data using HDFS as the location of your intermediate data. When you are done, you can have the last step specify S3N as its output location. However, we recommend storing the data from the last step into HDFS and then creating a Distcp step to copy it to S3 in bulk. If you want to perform multiple computations on the original data stored in S3, then you can Distcp the data down to the HDFS on your cluster to avoid reading it from S3 multiple times. Here are the URI schemas you would use with EMR: S3: s3n://<bucket>/<directory> HDFS: hdfs:///<directory> The data in S3 can be read or written with any standard tool such as S3 Organizer or s3cmd. Also, in the future, the best place for Elastic MapReduce specific questions would be in our developer support forum: http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52 Let me know if this answers your questions. Best regards, Andrew Hitchcock ------------------------ I have a question about how Amazon Elastic MapReduce handles persistent content stored in S3. I'm interested in using AEMR, but I'm concerned about latency introduced by copying content from S3 into HDFS. With AEMR, is the S3 storage actually an HDFS file system or does HDFS have to be repopulated every time you reinstantiate your Hadoop EC2 nodes? Larry Compton
