I
Sent from my iPhone
On Oct 1, 2008, at 9:55 PM, "Alexander Aristov" <[EMAIL PROTECTED]
> wrote:
Nothing special should be done with S3 as Hadoop will take care
about it
itself. You must only be sure that the bucket is accessible. Don't
worry
about the tmp folder.
I configured and run hadoop with S3 recently, it was Ok though I
considered
to switch back to hdfs and use S3 only as a final file store(backup)
because
I saw some productivity downgrade.
Alexander
2008/10/1 Kevin MacDonald <[EMAIL PROTECTED]>
The problem is not that I want to mix file systems. I would like to
start
from scratch with Hadoop configured to use S3. But to do that I
think that
various Paths which Nutch expects to be there have to be represented
somehow
in the S3 bucket that Hadoop is using. So, when Nutch tries to write
something to new Path("/tmp/...") Hadoop is able to do it. I think my
problem can be boiled down to this: How do you prepare an S3 bucket
so that
from Hadoop's perspective it looks like a hierarchical file system?
Kevin
On Wed, Oct 1, 2008 at 12:28 AM, Doğacan Güney <[EMAIL PROTECTED]>
wrote:
On Tue, Sep 30, 2008 at 11:52 PM, Kevin MacDonald <[EMAIL PROTECTED]
wrote:
Does anyone have experience configuring Hadoop to use S3 for using
nutch?
I
tried modifying my hadoop-site.xml configuration file and it
looks like
Hadoop is trying to use S3. But I think what's happening is that,
once
configured to use S3, Hadoop is ONLY looking at S3 for all files.
It's
trying to find a /tmp folder there, for example. And when running a
crawl
Hadoop is looking to S3 to find the seed urls folder. Are there
steps
that
need to happen to prepare an S3 bucket for use by Hadoop so that a
nutch
crawl can happen?
If you want to pass paths from other filesystems I think you can do
something like:
bin/nutch inject crawl/crawldb hdfs://machine:10000/.....
Kevin
--
Doğacan Güney
--
Best Regards
Alexander Aristov