Nothing special should be done with S3 as Hadoop will take care about it
itself. You must only be sure that the bucket is accessible. Don't worry
about the tmp folder.
I configured and run hadoop with S3 recently, it was Ok though I considered
to switch back to hdfs and use S3 only as a final file store(backup) because
I saw some productivity downgrade.
Alexander
2008/10/1 Kevin MacDonald <[EMAIL PROTECTED]>
> The problem is not that I want to mix file systems. I would like to start
> from scratch with Hadoop configured to use S3. But to do that I think that
> various Paths which Nutch expects to be there have to be represented
> somehow
> in the S3 bucket that Hadoop is using. So, when Nutch tries to write
> something to new Path("/tmp/...") Hadoop is able to do it. I think my
> problem can be boiled down to this: How do you prepare an S3 bucket so that
> from Hadoop's perspective it looks like a hierarchical file system?
> Kevin
>
> On Wed, Oct 1, 2008 at 12:28 AM, Doğacan Güney <[EMAIL PROTECTED]> wrote:
>
> > On Tue, Sep 30, 2008 at 11:52 PM, Kevin MacDonald <[EMAIL PROTECTED]
> >
> > wrote:
> > > Does anyone have experience configuring Hadoop to use S3 for using
> nutch?
> > I
> > > tried modifying my hadoop-site.xml configuration file and it looks like
> > > Hadoop is trying to use S3. But I think what's happening is that, once
> > > configured to use S3, Hadoop is ONLY looking at S3 for all files. It's
> > > trying to find a /tmp folder there, for example. And when running a
> crawl
> > > Hadoop is looking to S3 to find the seed urls folder. Are there steps
> > that
> > > need to happen to prepare an S3 bucket for use by Hadoop so that a
> nutch
> > > crawl can happen?
> >
> > If you want to pass paths from other filesystems I think you can do
> > something like:
> >
> > bin/nutch inject crawl/crawldb hdfs://machine:10000/.....
> >
> > > Kevin
> > >
> >
> >
> >
> > --
> > Doğacan Güney
> >
>
--
Best Regards
Alexander Aristov