Re: Using S3 with Hadoop/Nutch

Kevin MacDonald Fri, 03 Oct 2008 09:41:26 -0700

When I ran a nutch job using S3 Hadoop actually created the bucket. So I
have to assume it has all the permissions it requires. And yet, I still got
exceptions in the log saying that it could not find various paths. There
must be something else that needs to be done beyond simply specifying a
bucket name and AWS credentials in the hadoop-site.xml file.


2008/10/1 Alexander Aristov <[EMAIL PROTECTED]>

> Nothing special should be done with S3 as Hadoop will take care about it
> itself. You must only be sure that the bucket is accessible. Don't worry
> about the tmp folder.
>
> I configured and run hadoop with S3 recently, it was Ok though I considered
> to switch back to hdfs and use S3 only as a final file store(backup)
> because
> I saw some productivity downgrade.
>
> Alexander
>
> 2008/10/1 Kevin MacDonald <[EMAIL PROTECTED]>
>
> > The problem is not that I want to mix file systems. I would like to start
> > from scratch with Hadoop configured to use S3. But to do that I think
> that
> > various Paths which Nutch expects to be there have to be represented
> > somehow
> > in the S3 bucket that Hadoop is using. So, when Nutch tries to write
> > something to new Path("/tmp/...") Hadoop is able to do it. I think my
> > problem can be boiled down to this: How do you prepare an S3 bucket so
> that
> > from Hadoop's perspective it looks like a hierarchical file system?
> > Kevin
> >
> > On Wed, Oct 1, 2008 at 12:28 AM, Doğacan Güney <[EMAIL PROTECTED]>
> wrote:
> >
> > > On Tue, Sep 30, 2008 at 11:52 PM, Kevin MacDonald <
> [EMAIL PROTECTED]
> > >
> > > wrote:
> > > > Does anyone have experience configuring Hadoop to use S3 for using
> > nutch?
> > > I
> > > > tried modifying my hadoop-site.xml configuration file and it looks
> like
> > > > Hadoop is trying to use S3. But I think what's happening is that,
> once
> > > > configured to use S3, Hadoop is ONLY looking at S3 for all files.
> It's
> > > > trying to find a /tmp folder there, for example. And when running a
> > crawl
> > > > Hadoop is looking to S3 to find the seed urls folder. Are there steps
> > > that
> > > > need to happen to prepare an S3 bucket for use by Hadoop so that a
> > nutch
> > > > crawl can happen?
> > >
> > > If you want to pass paths from other filesystems I think you can do
> > > something like:
> > >
> > > bin/nutch inject crawl/crawldb hdfs://machine:10000/.....
> > >
> > > > Kevin
> > > >
> > >
> > >
> > >
> > > --
> > > Doğacan Güney
> > >
> >
>
>
>
> --
> Best Regards
> Alexander Aristov
>

Re: Using S3 with Hadoop/Nutch

Reply via email to