[Lucene-hadoop Wiki] Update of "AmazonS3" by TomWhite

Apache Wiki Sun, 07 Jan 2007 14:10:22 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by TomWhite:
http://wiki.apache.org/lucene-hadoop/AmazonS3

New page:
[http://aws.amazon.com/s3 Amazon S3] (Simple Storage Service) is a data storage 
service. You are billed
monthly for storage and data transfer. Transfer between S3 and Self:AmazonEC2 
is free. This makes use of
S3 attractive for Hadoop users who run clusters on EC2.

There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a 
replacement for HDFS
(i.e. using it as a reliable distributed filesystem with support for very large 
files)
or as a convenient repository for data input to and output from Map/Reduce. In 
the second case
HDFS is still used for the Map/Reduce phase.

S3 support was introduced in Hadoop 0.10 
([http://issues.apache.org/jira/browse/HADOOP-574 HADOOP-574]),
but it needs the patch in [http://issues.apache.org/jira/browse/HADOOP-857 
HADOOP-857] to work properly.
The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] 
makes S3 work with the
Hadoop CopyFile tool.
(Hopefully these patches will be integrated in the next release.)

= Setting up hadoop to use S3 as a replacement for HDFS =

Put the following in ''conf/hadoop-site.xml'' to set the default filesystem to 
be S3:

{{{
<property>
  <name>fs.default.name</name>
  <value>s3://BUCKET</value>
</property>

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>ID</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>SECRET</value>
</property>
}}}

Alternatively, you can put the access key ID and the secret access key into the 
S3 URI as the user info:

{{{
<property>
  <name>fs.default.name</name>
  <value>s3://ID:[EMAIL PROTECTED]</value>
</property>
}}}

Note that since the secret
access key can contain slashes, you must remember to escape them by replacing 
each slash `/` with the string `%2F`.
Keys specified in the URI take precedence over any specified using the 
properties `fs.s3.awsAccessKeyId` and
`fs.s3.awsSecretAccessKey`.

Running the Map/Reduce demo in the 
[http://lucene.apache.org/hadoop/api/index.html Hadoop API Documentation] using
S3 is now a matter of running:

{{{
mkdir input
cp conf/*.xml input
bin/hadoop fs -put input input
bin/hadoop fs -ls input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
bin/hadoop fs -get output output
cat output/*
}}}

To run in distributed mode you only need to run a JobTracker - the HDFS 
NameNode is unnecessary.

= Setting up hadoop to use S3 as a repository for data input to and output from 
Map/Reduce =

The idea here is to put your input on S3, then transfer it to HDFS using 
the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the 
output is copied to S3
as input to a further job, or retrieved as a final result.

[More instruction will be added after 
[https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] is complete.]

[Lucene-hadoop Wiki] Update of "AmazonS3" by TomWhite

Reply via email to