We recommend that people use Amazon S3 as the durable store when using Elastic MapReduce. We consider the HDFS on Elastic MapReduce clusters to be transient.
With that said, you need some way to get your data into S3 from HDFS. We recommend storing the files directly in S3 (with S3N) and not using the S3 block file system. That presents two challenges: 1. Making sure all files on your cluster are less than 5 GB. 2. Uploading your files without the use of S3N (which wasn't introduced until 0.18). You'll probably want to write a DistCp-like job which reads the files from HDFS and uploads them to S3. If necessary, it should also detect files that are larger than 5 GB and split them into multiple pieces. Andrew On 3/22/10 9:23 PM, "ilayaraja" <ilayar...@rediff.co.in> wrote: Hi Andrew, Yes. The data is on EC2 cluster only. Regards, Ilay ----- Original Message ----- From: Hitchcock, Andrew <mailto:a...@amazon.com> To: common-dev@hadoop.apache.org ; ilayar...@rediff.co.in Sent: Tuesday, March 23, 2010 1:57 AM Subject: Re: Hadoop Compatibility and EMR Hi, At this time Elastic MapReduce only supports Hadoop 0.18.3. The cluster that stores the 10 TB of data, is that currently running on Amazon EC2? Regards, Andrew On Mar 21, 2010, at 12:23 AM, "ilayaraja" <ilayar...@rediff.co.in> wrote: > Hi, > > We 've been using hadoop 15.5 in our production environment where we have > about 10 TB of data stored on the dfs. > The files were generated as mapreduce output. We want to move our env. to > Amazon Elastic Map Reduce (EMR) which throws the following questions to > us: > > 1. EMR supports only hadoop 19.0 and above. Is it possible to use the > current data that were generated with hadoop 15.5 from hadoop 19.0? > > 2. Or how can we make it possible to use or update to hadoop 19.0 from > hadoop 15.5? What are the issues expected while doing so? > > > Regards, > Ilayaraja