I second Todd's recommendation. Elastic MapReduce currently doesn't have a mechanism for users to change mapred.tasktracker.map.tasks.maximum. However, by default we run more mappers per core than is generally recommended, because we've found it results in better performance in the EC2/S3 environment.
Andrew On 7/22/09 4:45 PM, "Todd Lipcon" <[email protected]> wrote: On Wed, Jul 22, 2009 at 4:20 PM, Hitchcock, Andrew <[email protected]> wrote: > We don't have hard numbers on S3 transfer rates. The cluster-wide transfer > rate depends on a number of factors such as instance type, cluster size, and > general network congestion. > Your mileage of course will vary based on the factors Andrew mentioned, but in practice I generally see 1-2MB per second per reader for large files off S3 using the s3n filesystem. So, if you are not CPU bound, I recommend using a higher-than-usual number of map slots in order to pull enough throughput out of your S3 bucket. -Todd > > On 7/21/09 6:12 AM, "Larry Compton" <[email protected]> wrote: > > Andrew, > > Thanks for the information. Can you give me some numbers on transfer > rates from S3 into HDFS? Processing the content in place in S3 isn't > an option for us. > > Larry > > On Fri, Jul 17, 2009 at 5:57 PM, Hitchcock, Andrew<[email protected]> wrote: > > Hi Larry, > > > > I'm an engineer with Elastic MapReduce. The latency from your EC2 cluster > to S3 is certainly higher than within your cluster using HDFS. However, > there are ways to mitigate the latency. Of course, the best way to know if > EMR works with your use case is to give it a try. > > > > We recommend using the S3 native file system (S3N) for use with Elastic > MapReduce, which reads the files from S3 in their native format. The > standard workflow for an EMR job flow would be to create a step that reads > from S3 and does the first round of processing. Then you can run any number > of processing steps on the data using HDFS as the location of your > intermediate data. When you are done, you can have the last step specify S3N > as its output location. However, we recommend storing the data from the last > step into HDFS and then creating a Distcp step to copy it to S3 in bulk. > > > > If you want to perform multiple computations on the original data stored > in S3, then you can Distcp the data down to the HDFS on your cluster to > avoid reading it from S3 multiple times. > > > > Here are the URI schemas you would use with EMR: > > > > S3: s3n://<bucket>/<directory> > > HDFS: hdfs:///<directory> > > > > The data in S3 can be read or written with any standard tool such as S3 > Organizer or s3cmd. > > > > Also, in the future, the best place for Elastic MapReduce specific > questions would be in our developer support forum: > > > > http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52 > > > > Let me know if this answers your questions. > > > > Best regards, > > Andrew Hitchcock > > > > > > > > ------------------------ > > I have a question about how Amazon Elastic MapReduce handles > > persistent content stored in S3. I'm interested in using AEMR, but I'm > > concerned about latency introduced by copying content from S3 into > > HDFS. With AEMR, is the S3 storage actually an HDFS file system or > > does HDFS have to be repopulated every time you reinstantiate your > > Hadoop EC2 nodes? > > > > Larry Compton > > > >
