There is an open Jira issue for adding support for reading regular S3
files (https://issues.apache.org/jira/browse/HADOOP-930) which would
solve this problem. Until this is implemented Ahad's suggestion is a
good workaround (you can use distcp too).
Tom
On 05/09/07, Ahad Rana <[EMAIL PROTECTED]> wrote:
> Hi Toby,
>
> Keep in mind that in the S3 Filesystem is a block based file system
> implementation that is built on top of the S3 Service. Thus, if you store a
> file named foo using the hadoop tools in an S3 Bucket, the underlying code
> will store metadata (block information + hadoop specific headers) in the key
> /user/root/foo. Actual data will be stored under under one or more keys of
> the form: /user/root/BLOCK_[BLOCKID], with each block representing at the
> maximum a BLOCK SIZE portion (32 MB by default) of the file's content.
>
> So, this means that you cannot just copy files into an S3 bucket (from
> another tool) and have them be visible via the S3 FileSystem support in
> hadoop. You would have to use something like bin/hadoop fs -fs s3://bucket
> -copyFromLocal localfile /user/root/s3file to copy files across into S3 in
> the hadoop specific format.
>
> Hope that helps.
>
> Ahad.
>
> On 9/5/07, Toby DiPasquale <[EMAIL PROTECTED]> wrote:
> >
> > Hi all,
> >
> > I'm trying to get Hadoop 0.14.0 to talk to S3 but its not working out
> > for me. My credentials are definitely right (checked multiple times
> > and that's not the issue in any case) and the bucket I'm querying
> > definitely does exist already and has keys under it.
> >
> > However, when I try to access S3, I get really weird stuff going on.
> > If I try this:
> >
> > bin/hadoop fs -ls key-with-stuff-under-it
> >
> > it fails. I set it to DEBUG and I can see that the S3 library is
> > trying to query:
> >
> > GET /BUCKET/%2Fuser%2Froot%2Fkey-with-stuff-under-it
> >
> > which is wrong for obvious reasons, but which I can sort of
> > understand, as the default value of the workingDir instance variable
> > in S3FileSystem.java is "/user/${user.name}" (anybody know where this
> > default comes from, btw? it make no sense to me as is).
> >
> > When I try the full URI:
> >
> > bin/hadoop fs -ls s3://BUCKET/key-with-stuff-under-it
> >
> > it still fails, instead doing this even stranger thing and trying to query
> >
> > GET /BUCKET/%2Fkey-with-stuff-under-it
> >
> > Note: the embedded %2F in there that makes it incorrect.
> >
> > It also does this same thing whether I embed the credentials in the
> > full s3:// URI or not.
> >
> > Then, I thought maybe it doesn't want a bucket at all in
> > fs.default.name, so I took that out and then the configuration failed
> > to parse on startup, throwing an IllegalArgumentException, telling me
> > it was expecting an authority portion to the URI it was given.
> >
> > Are the docs on the wiki and Amazon's articles wrong about how to
> > configure hadoop-site.xml for S3 as a filesystem? Or am I doing
> > something wrong in my config, leaving something out, etc? Any
> > thoughts? Should I be setting the workingDir somewhere in the config
> > to something more sensible? I couldn't find much about this on the
> > mailing list or the forums at Amazon so I thought I'd just ask. My
> > config info is below.
> >
> > Any help is greatly appreciated. Thanks :)
> >
> > My mapred-default.xml is the default one from
> > hadoop-0.14.0/src/contrib/ec2/image/hadoop-init. I am running this on
> > EC2 with just MapReduce (no HDFS NameNodes/DataNodes running). Here's
> > my hadoop-site.xml (anonymized, of course):
> >
> > ------------------------------------
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >
> > <configuration>
> >
> > <property>
> > <name>hadoop.tmp.dir</name>
> > <value>/mnt/hadoop</value>
> > </property>
> >
> > <property>
> > <name>fs.default.name</name>
> > <value>s3://BUCKET</value>
> > </property>
> >
> > <property>
> > <name>fs.s3.awsAccessKeyId</name>
> > <value>XXXXXXXXXXXXXXXXXX</value>
> > </property>
> >
> > <property>
> > <name>fs.s3.awsSecretAccessKey</name>
> > <value>XXXXXXXXXXXXXXXXXX</value>
> > </property>
> >
> > <property>
> > <name>mapred.job.tracker</name>
> > <value>$MASTER_HOST:50002</value>
> > </property>
> >
> > </configuration>
> > ------------------------------------
> >
> > --
> > Toby DiPasquale
> >
>