Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by TomWhite: http://wiki.apache.org/lucene-hadoop/AmazonS3 New page: [http://aws.amazon.com/s3 Amazon S3] (Simple Storage Service) is a data storage service. You are billed monthly for storage and data transfer. Transfer between S3 and Self:AmazonEC2 is free. This makes use of S3 attractive for Hadoop users who run clusters on EC2. There are two ways that S3 can be used with Hadoop's Map/Reduce, either as a replacement for HDFS (i.e. using it as a reliable distributed filesystem with support for very large files) or as a convenient repository for data input to and output from Map/Reduce. In the second case HDFS is still used for the Map/Reduce phase. S3 support was introduced in Hadoop 0.10 ([http://issues.apache.org/jira/browse/HADOOP-574 HADOOP-574]), but it needs the patch in [http://issues.apache.org/jira/browse/HADOOP-857 HADOOP-857] to work properly. The patch in [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] makes S3 work with the Hadoop CopyFile tool. (Hopefully these patches will be integrated in the next release.) = Setting up hadoop to use S3 as a replacement for HDFS = Put the following in ''conf/hadoop-site.xml'' to set the default filesystem to be S3: {{{ <property> <name>fs.default.name</name> <value>s3://BUCKET</value> </property> <property> <name>fs.s3.awsAccessKeyId</name> <value>ID</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>SECRET</value> </property> }}} Alternatively, you can put the access key ID and the secret access key into the S3 URI as the user info: {{{ <property> <name>fs.default.name</name> <value>s3://ID:[EMAIL PROTECTED]</value> </property> }}} Note that since the secret access key can contain slashes, you must remember to escape them by replacing each slash `/` with the string `%2F`. Keys specified in the URI take precedence over any specified using the properties `fs.s3.awsAccessKeyId` and `fs.s3.awsSecretAccessKey`. Running the Map/Reduce demo in the [http://lucene.apache.org/hadoop/api/index.html Hadoop API Documentation] using S3 is now a matter of running: {{{ mkdir input cp conf/*.xml input bin/hadoop fs -put input input bin/hadoop fs -ls input bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' bin/hadoop fs -get output output cat output/* }}} To run in distributed mode you only need to run a JobTracker - the HDFS NameNode is unnecessary. = Setting up hadoop to use S3 as a repository for data input to and output from Map/Reduce = The idea here is to put your input on S3, then transfer it to HDFS using the `bin/hadoop distcp` tool. Then once the Map/Reduce job is complete the output is copied to S3 as input to a further job, or retrieved as a final result. [More instruction will be added after [https://issues.apache.org/jira/browse/HADOOP-862 HADOOP-862] is complete.]