Re: How do I reference S3 from an EC2 Hadoop cluster?

Tom White Wed, 25 Nov 2009 10:25:36 -0800

On Tue, Nov 24, 2009 at 9:27 PM, Mark Kerzner <[email protected]> wrote:
> Yes, Tom, I saw all these problems. I think that I should stop trying to
> imitate EMR - that's where the storing data on S3 appeared, and transfer
> data directly to the Hadoop cluster. Then I will be using all as intended.
>
> Is there a way to scp directly to the HDFS, or do I need to scp to local
> storage on some machine, and then - to HDFS?


distcp is the appropriate tool for this. There is some guidance on
http://wiki.apache.org/hadoop/AmazonS3.

> Also, is there a way to make
> the master a bigger instance than that of the slaves?

No, this is not supported, but I can see it would be useful,
particularly for larger clusters. Please consider opening a JIRA for
it.

Cheers,
Tom

>
> Thank you,
> Mark
>
> On Tue, Nov 24, 2009 at 11:20 PM, Tom White <[email protected]> wrote:
>
>> Mark,
>>
>> If the data was transferred to S3 outside of Hadoop then you should
>> use the s3n filesystem scheme (see the explanation on
>> http://wiki.apache.org/hadoop/AmazonS3 for the differences between the
>> Hadoop S3 filesystems).
>>
>> Also, some people have had problems embedding the secret key in the
>> URI, so you can set it in the configuration as follows:
>>
>> <property>
>>  <name>fs.s3n.awsAccessKeyId</name>
>>  <value>ID</value>
>> </property>
>>
>> <property>
>>  <name>fs.s3n.awsSecretAccessKey</name>
>>  <value>SECRET</value>
>> </property>
>>
>> Then use a URI of the form s3n://<BUCKET>/path/to/logs
>>
>> Cheers,
>> Tom
>>
>> On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner <[email protected]>
>> wrote:
>> > Hi,
>> >
>> > I need to copy data from S3 to HDFS. This instruction
>> >
>> > bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/logs logs
>> >
>> > does not seem to work.
>> >
>> > Thank you.
>> >
>>
>

Re: How do I reference S3 from an EC2 Hadoop cluster?

Reply via email to