We recommend that people use Amazon S3 as the durable store when using Elastic 
MapReduce. We consider the HDFS on Elastic MapReduce clusters to be transient.

With that said, you need some way to get your data into S3 from HDFS. We 
recommend storing the files directly in S3 (with S3N) and not using the S3 
block file system. That presents two challenges:

1. Making sure all files on your cluster are less than 5 GB.
2. Uploading your files without the use of S3N (which wasn't introduced until 
0.18).

You'll probably want to write a DistCp-like job which reads the files from HDFS 
and uploads them to S3. If necessary, it should also detect files that are 
larger than 5 GB and split them into multiple pieces.

Andrew




On 3/22/10 9:23 PM, "ilayaraja" <ilayar...@rediff.co.in> wrote:

Hi Andrew,

Yes. The data is on EC2 cluster only.

Regards,
Ilay

----- Original Message -----

From:  Hitchcock, Andrew <mailto:a...@amazon.com>

To: common-dev@hadoop.apache.org ;  ilayar...@rediff.co.in

Sent: Tuesday, March 23, 2010 1:57  AM

Subject: Re: Hadoop Compatibility and  EMR


Hi,

At this time Elastic MapReduce only  supports Hadoop 0.18.3.

The cluster that stores the 10 TB of data, is  that currently running on Amazon 
EC2?

Regards,
Andrew

On Mar  21, 2010, at 12:23 AM, "ilayaraja" <ilayar...@rediff.co.in>  wrote:
> Hi,
>
> We 've been using hadoop 15.5 in our  production environment where we have 
> about 10 TB of data stored on the  dfs.
> The files were generated as mapreduce output. We want to move our  env. to 
> Amazon Elastic Map Reduce (EMR) which throws the following questions  to > us:
>
> 1. EMR supports only hadoop 19.0 and above. Is it  possible to use the 
> current data that were generated with hadoop 15.5 from  hadoop 19.0?
>
> 2. Or how can we make it possible to use or  update to hadoop 19.0 from 
> hadoop 15.5? What are the issues expected while  doing so?
>
>
> Regards,
> Ilayaraja

Reply via email to