Re: Accessing S3 with Hadoop?

Ahad Rana Wed, 05 Sep 2007 13:36:47 -0700

Hi Toby,

Keep in mind that in the S3 Filesystem is a block based file system
implementation that is built on top of the S3 Service. Thus, if you store a
file named foo using the hadoop tools in an S3 Bucket, the underlying code
will store metadata (block information + hadoop specific headers) in the key
/user/root/foo. Actual data will be stored under under one or more keys of
the form: /user/root/BLOCK_[BLOCKID], with each block representing at the
maximum a BLOCK SIZE portion (32 MB by default) of the  file's content.


So, this means that you cannot just copy files into an S3 bucket (from
another tool) and have them be visible via the S3 FileSystem support in
hadoop. You would have to use something like bin/hadoop fs -fs s3://bucket
-copyFromLocal localfile  /user/root/s3file to copy files across into S3 in
the hadoop specific format.

Hope that helps.

Ahad.

On 9/5/07, Toby DiPasquale <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> I'm trying to get Hadoop 0.14.0 to talk to S3 but its not working out
> for me. My credentials are definitely right (checked multiple times
> and that's not the issue in any case) and the bucket I'm querying
> definitely does exist already and has keys under it.
>
> However, when I try to access S3, I get really weird stuff going on.
> If I try this:
>
>   bin/hadoop fs -ls key-with-stuff-under-it
>
> it fails. I set it to DEBUG and I can see that the S3 library is
> trying to query:
>
>   GET /BUCKET/%2Fuser%2Froot%2Fkey-with-stuff-under-it
>
> which is wrong for obvious reasons, but which I can sort of
> understand, as the default value of the workingDir instance variable
> in S3FileSystem.java is "/user/${user.name}" (anybody know where this
> default comes from, btw? it make no sense to me as is).
>
> When I try the full URI:
>
>   bin/hadoop fs -ls s3://BUCKET/key-with-stuff-under-it
>
> it still fails, instead doing this even stranger thing and trying to query
>
>   GET /BUCKET/%2Fkey-with-stuff-under-it
>
> Note: the embedded %2F in there that makes it incorrect.
>
> It also does this same thing whether I embed the credentials in the
> full s3:// URI or not.
>
> Then, I thought maybe it doesn't want a bucket at all in
> fs.default.name, so I took that out and then the configuration failed
> to parse on startup, throwing an IllegalArgumentException, telling me
> it was expecting an authority portion to the URI it was given.
>
> Are the docs on the wiki and Amazon's articles wrong about how to
> configure hadoop-site.xml for S3 as a filesystem? Or am I doing
> something wrong in my config, leaving something out, etc? Any
> thoughts? Should I be setting the workingDir somewhere in the config
> to something more sensible? I couldn't find much about this on the
> mailing list or the forums at Amazon so I thought I'd just ask. My
> config info is below.
>
> Any help is greatly appreciated. Thanks :)
>
> My mapred-default.xml is the default one from
> hadoop-0.14.0/src/contrib/ec2/image/hadoop-init. I am running this on
> EC2 with just MapReduce (no HDFS NameNodes/DataNodes running). Here's
> my hadoop-site.xml (anonymized, of course):
>
> ------------------------------------
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <configuration>
>
> <property>
>   <name>hadoop.tmp.dir</name>
>   <value>/mnt/hadoop</value>
> </property>
>
> <property>
>   <name>fs.default.name</name>
>   <value>s3://BUCKET</value>
> </property>
>
> <property>
>   <name>fs.s3.awsAccessKeyId</name>
>   <value>XXXXXXXXXXXXXXXXXX</value>
> </property>
>
> <property>
>   <name>fs.s3.awsSecretAccessKey</name>
>   <value>XXXXXXXXXXXXXXXXXX</value>
> </property>
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>$MASTER_HOST:50002</value>
> </property>
>
> </configuration>
> ------------------------------------
>
> --
> Toby DiPasquale
>

Re: Accessing S3 with Hadoop?

Reply via email to