Re: Using S3 Block FileSystem as HDFS replacement
by editing the hadoop-site.xml, you set the default. but I don't recommend changing the default on EC2. but you can specify the filesystem to use through the URL that references your data (jobConf.addInputPath etc) for a particular job. in the case of the S3 block filesystem, just use a s3:// url. ckw On Jun 30, 2008, at 8:04 PM, slitz wrote: Hello, I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki that it's possible to choose either S3 native FileSystem or S3 Block Filesystem. I would like to use S3 Block FileSystem to avoid the task of manually transferring data from S3 to HDFS every time i want to run a job. I'm still experimenting with EC2 contrib scripts and those seem to be excellent. What i can't understand is how may be possible to use S3 using a public hadoop AMI since from my understanding hadoop-site.xml gets written on each instance startup with the options on hadoop-init, and it seems that the public AMI (at least the 0.17.0 one) is not configured to use S3 at all(which makes sense because the bucket would need individual configuration anyway). So... to use S3 block FileSystem with EC2 i need to create a custom AMI with a modified hadoop-init script right? or am I completely confused? slitz -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Using S3 Block FileSystem as HDFS replacement
That's a good point, in fact it didn't occured me that i could access it like that. But some questions came to my mind: How do i put something into the fs? something like bin/hadoop fs -put input input will not work well since s3 is not the default fs, so i tried to do bin/hadoop fs -put input s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't worked, i always got an error complaining about not having provided the ID/secret for s3. To experiment a little i tried to edit conf/hadoop-site.xml (something that's just possible to do when experimenting because of lack of presistence of these changes, unless a new AMI is created) and added the fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey properties and changed fs.default.name to an s3:// one. This worked to things like: mkdir input cp conf/*.xml input bin/hadoop fs -put input input bin/hadoop fs -ls input but then i faced another problem, when i tried to run bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' that now should be able to run using S3 as a FileSystem, i got this error: 08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to process : 2 08/07/01 22:12:57 INFO mapred.JobClient: Running job: job_200807012133_0010 08/07/01 22:12:58 INFO mapred.JobClient: map 100% reduce 100% java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) (...) I tried several times and with the wordcount example but the error were always the same. What should be the problem here? And how may i access the FileSystem with bin/hadoop fs ... if the default filesystem isn't the S3? thank you very much :) slitz On Tue, Jul 1, 2008 at 4:43 PM, Chris K Wensel [EMAIL PROTECTED] wrote: by editing the hadoop-site.xml, you set the default. but I don't recommend changing the default on EC2. but you can specify the filesystem to use through the URL that references your data (jobConf.addInputPath etc) for a particular job. in the case of the S3 block filesystem, just use a s3:// url. ckw On Jun 30, 2008, at 8:04 PM, slitz wrote: Hello, I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki that it's possible to choose either S3 native FileSystem or S3 Block Filesystem. I would like to use S3 Block FileSystem to avoid the task of manually transferring data from S3 to HDFS every time i want to run a job. I'm still experimenting with EC2 contrib scripts and those seem to be excellent. What i can't understand is how may be possible to use S3 using a public hadoop AMI since from my understanding hadoop-site.xml gets written on each instance startup with the options on hadoop-init, and it seems that the public AMI (at least the 0.17.0 one) is not configured to use S3 at all(which makes sense because the bucket would need individual configuration anyway). So... to use S3 block FileSystem with EC2 i need to create a custom AMI with a modified hadoop-init script right? or am I completely confused? slitz -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Using S3 Block FileSystem as HDFS replacement
How do i put something into the fs? something like bin/hadoop fs -put input input will not work well since s3 is not the default fs, so i tried to do bin/hadoop fs -put input s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't worked, i always got an error complaining about not having provided the ID/ secret for s3. hadoop distcp ... should work with your s3 urls 08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to process : 2 08/07/01 22:12:57 INFO mapred.JobClient: Running job: job_200807012133_0010 08/07/01 22:12:58 INFO mapred.JobClient: map 100% reduce 100% java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) (...) I tried several times and with the wordcount example but the error were always the same. unsure. many things could be conspiring against you. leave the defaults in hadoop-site and use distcp to copy things around. that's the simplest I think. ckw
Using S3 Block FileSystem as HDFS replacement
Hello, I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki that it's possible to choose either S3 native FileSystem or S3 Block Filesystem. I would like to use S3 Block FileSystem to avoid the task of manually transferring data from S3 to HDFS every time i want to run a job. I'm still experimenting with EC2 contrib scripts and those seem to be excellent. What i can't understand is how may be possible to use S3 using a public hadoop AMI since from my understanding hadoop-site.xml gets written on each instance startup with the options on hadoop-init, and it seems that the public AMI (at least the 0.17.0 one) is not configured to use S3 at all(which makes sense because the bucket would need individual configuration anyway). So... to use S3 block FileSystem with EC2 i need to create a custom AMI with a modified hadoop-init script right? or am I completely confused? slitz