I

Sent from my iPhone

On Oct 1, 2008, at 9:55 PM, "Alexander Aristov" <[EMAIL PROTECTED] > wrote:

Nothing special should be done with S3 as Hadoop will take care about it itself. You must only be sure that the bucket is accessible. Don't worry
about the tmp folder.

I configured and run hadoop with S3 recently, it was Ok though I considered to switch back to hdfs and use S3 only as a final file store(backup) because
I saw some productivity downgrade.

Alexander

2008/10/1 Kevin MacDonald <[EMAIL PROTECTED]>

The problem is not that I want to mix file systems. I would like to start from scratch with Hadoop configured to use S3. But to do that I think that
various Paths which Nutch expects to be there have to be represented
somehow
in the S3 bucket that Hadoop is using. So, when Nutch tries to write
something to new Path("/tmp/...") Hadoop is able to do it. I think my
problem can be boiled down to this: How do you prepare an S3 bucket so that
from Hadoop's perspective it looks like a hierarchical file system?
Kevin

On Wed, Oct 1, 2008 at 12:28 AM, Doğacan Güney <[EMAIL PROTECTED]> wrote:

On Tue, Sep 30, 2008 at 11:52 PM, Kevin MacDonald <[EMAIL PROTECTED]

wrote:
Does anyone have experience configuring Hadoop to use S3 for using
nutch?
I
tried modifying my hadoop-site.xml configuration file and it looks like Hadoop is trying to use S3. But I think what's happening is that, once configured to use S3, Hadoop is ONLY looking at S3 for all files. It's
trying to find a /tmp folder there, for example. And when running a
crawl
Hadoop is looking to S3 to find the seed urls folder. Are there steps
that
need to happen to prepare an S3 bucket for use by Hadoop so that a
nutch
crawl can happen?

If you want to pass paths from other filesystems I think you can do
something like:

bin/nutch inject crawl/crawldb hdfs://machine:10000/.....

Kevin




--
Doğacan Güney





--
Best Regards
Alexander Aristov

Reply via email to