Re: Using S3 with Hadoop/Nutch

Kevin MacDonald Fri, 03 Oct 2008 09:17:14 -0700

I

Sent from my iPhone

On Oct 1, 2008, at 9:55 PM, "Alexander Aristov" <[EMAIL PROTECTED]> wrote:

Nothing special should be done with S3 as Hadoop will take careabout ititself. You must only be sure that the bucket is accessible. Don'tworry
about the tmp folder.
I configured and run hadoop with S3 recently, it was Ok though Iconsideredto switch back to hdfs and use S3 only as a final file store(backup)because
I saw some productivity downgrade.

Alexander

2008/10/1 Kevin MacDonald <[EMAIL PROTECTED]>
The problem is not that I want to mix file systems. I would like tostartfrom scratch with Hadoop configured to use S3. But to do that Ithink that
various Paths which Nutch expects to be there have to be represented
somehow
in the S3 bucket that Hadoop is using. So, when Nutch tries to write
something to new Path("/tmp/...") Hadoop is able to do it. I think my
problem can be boiled down to this: How do you prepare an S3 bucketso that
from Hadoop's perspective it looks like a hierarchical file system?
Kevin
On Wed, Oct 1, 2008 at 12:28 AM, Doğacan Güney <[EMAIL PROTECTED]>wrote:
On Tue, Sep 30, 2008 at 11:52 PM, Kevin MacDonald <[EMAIL PROTECTED]

wrote:
Does anyone have experience configuring Hadoop to use S3 for using
nutch?
I
tried modifying my hadoop-site.xml configuration file and itlooks likeHadoop is trying to use S3. But I think what's happening is that,onceconfigured to use S3, Hadoop is ONLY looking at S3 for all files.It's
trying to find a /tmp folder there, for example. And when running a
crawl
Hadoop is looking to S3 to find the seed urls folder. Are theresteps
that
need to happen to prepare an S3 bucket for use by Hadoop so that a
nutch
crawl can happen?
If you want to pass paths from other filesystems I think you can do
something like:

bin/nutch inject crawl/crawldb hdfs://machine:10000/.....
Kevin
--
Doğacan Güney
--
Best Regards
Alexander Aristov

Re: Using S3 with Hadoop/Nutch

Reply via email to