Re: Using S3 Block FileSystem as HDFS replacement

2008-07-01 Thread Chris K Wensel
by editing the hadoop-site.xml, you set the default. but I don't  
recommend changing the default on EC2.


but you can specify the filesystem to use through the URL that  
references your data (jobConf.addInputPath etc) for a particular job.  
in the case of the S3 block filesystem, just use a s3:// url.


ckw

On Jun 30, 2008, at 8:04 PM, slitz wrote:


Hello,
I've been trying to setup hadoop to use s3 as filesystem, i read in  
the wiki

that it's possible to choose either S3 native FileSystem or S3 Block
Filesystem. I would like to use S3 Block FileSystem to avoid the  
task of
manually transferring data from S3 to HDFS every time i want to  
run a job.


I'm still experimenting with EC2 contrib scripts and those seem to be
excellent.
What i can't understand is how may be possible to use S3 using a  
public
hadoop AMI since from my understanding hadoop-site.xml gets written  
on each
instance startup with the options on hadoop-init, and it seems that  
the

public AMI (at least the 0.17.0 one) is not configured to use S3 at
all(which makes sense because the bucket would need individual  
configuration

anyway).

So... to use S3 block FileSystem with EC2 i need to create a custom  
AMI with

a modified hadoop-init script right? or am I completely confused?


slitz


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/








Re: Using S3 Block FileSystem as HDFS replacement

2008-07-01 Thread slitz
That's a good point, in fact it didn't occured me that i could access it
like that.
But some questions came to my mind:

How do i put something into the fs?
something like bin/hadoop fs -put input input will not work well since s3
is not the default fs, so i tried to do bin/hadoop fs -put input
s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't worked, i
always got an error complaining about not having provided the ID/secret for
s3.

To experiment a little i tried to edit conf/hadoop-site.xml (something
that's just possible to do when experimenting because of lack of presistence
of these changes, unless a new AMI is created) and added
the fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey properties and changed
fs.default.name to an s3:// one.
This worked to things like:

mkdir input
cp conf/*.xml input
bin/hadoop fs -put input input
bin/hadoop fs -ls input

but then i faced another problem, when i tried to run
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
that now should be able to run using S3 as a FileSystem, i got this error:

08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/07/01 22:12:57 INFO mapred.JobClient: Running job: job_200807012133_0010
08/07/01 22:12:58 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
(...)

I tried several times and with the wordcount example but the error were
always the same.

What should be the problem here? And how may i access the FileSystem with
bin/hadoop fs ... if the default filesystem isn't the S3?

thank you very much :)

slitz

On Tue, Jul 1, 2008 at 4:43 PM, Chris K Wensel [EMAIL PROTECTED] wrote:

 by editing the hadoop-site.xml, you set the default. but I don't recommend
 changing the default on EC2.

 but you can specify the filesystem to use through the URL that references
 your data (jobConf.addInputPath etc) for a particular job. in the case of
 the S3 block filesystem, just use a s3:// url.

 ckw


 On Jun 30, 2008, at 8:04 PM, slitz wrote:

  Hello,
 I've been trying to setup hadoop to use s3 as filesystem, i read in the
 wiki
 that it's possible to choose either S3 native FileSystem or S3 Block
 Filesystem. I would like to use S3 Block FileSystem to avoid the task of
 manually transferring data from S3 to HDFS every time i want to run a
 job.

 I'm still experimenting with EC2 contrib scripts and those seem to be
 excellent.
 What i can't understand is how may be possible to use S3 using a public
 hadoop AMI since from my understanding hadoop-site.xml gets written on
 each
 instance startup with the options on hadoop-init, and it seems that the
 public AMI (at least the 0.17.0 one) is not configured to use S3 at
 all(which makes sense because the bucket would need individual
 configuration
 anyway).

 So... to use S3 block FileSystem with EC2 i need to create a custom AMI
 with
 a modified hadoop-init script right? or am I completely confused?


 slitz


 --
 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/









Re: Using S3 Block FileSystem as HDFS replacement

2008-07-01 Thread Chris K Wensel

How do i put something into the fs?
something like bin/hadoop fs -put input input will not work well  
since s3

is not the default fs, so i tried to do bin/hadoop fs -put input
s3://ID:[EMAIL PROTECTED]/input (and some variations of it) but didn't  
worked, i
always got an error complaining about not having provided the ID/ 
secret for

s3.



 hadoop distcp ...

should work with your s3 urls



08/07/01 22:12:55 INFO mapred.FileInputFormat: Total input paths to  
process

: 2
08/07/01 22:12:57 INFO mapred.JobClient: Running job:  
job_200807012133_0010

08/07/01 22:12:58 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
(...)

I tried several times and with the wordcount example but the error  
were

always the same.



unsure. many things could be conspiring against you. leave the  
defaults in hadoop-site and use distcp to copy things around. that's  
the simplest I think.


ckw


Using S3 Block FileSystem as HDFS replacement

2008-06-30 Thread slitz
Hello,
I've been trying to setup hadoop to use s3 as filesystem, i read in the wiki
that it's possible to choose either S3 native FileSystem or S3 Block
Filesystem. I would like to use S3 Block FileSystem to avoid the task of
manually transferring data from S3 to HDFS every time i want to run a job.

I'm still experimenting with EC2 contrib scripts and those seem to be
excellent.
What i can't understand is how may be possible to use S3 using a public
hadoop AMI since from my understanding hadoop-site.xml gets written on each
instance startup with the options on hadoop-init, and it seems that the
public AMI (at least the 0.17.0 one) is not configured to use S3 at
all(which makes sense because the bucket would need individual configuration
anyway).

So... to use S3 block FileSystem with EC2 i need to create a custom AMI with
a modified hadoop-init script right? or am I completely confused?


slitz