Take a look at Heritrix and NutchWAX, I recommend them for your task.

Best regards,

---
Andreas P. Koenzen

On 19/02/2010, at 07:58 a.m., Paul Dhaliwal wrote:

Hi Amit,

I understand your question a little better now.

I think you probably want to take a look at NDFS :
http://wiki.apache.org/nutch/NutchDistributedFileSystem

and WebDB doc would probably help you understand thing also:
http://wiki.apache.org/nutch/DistributedWebDB

Hope this helps.

Paul



On Fri, Feb 19, 2010 at 2:03 AM, Amit Agarwal <aagarwa...@gmail.com> wrote:

Hi Paul,

Yes on first read, it looks like a good option. I have downloaded the nutch
from their official download site.

I believe that Fetcher is the class which is responsible for downloading
any
web content. I successfully ran the test cases in TestFetcher class, and
they execute with no issues. But I am unable to locate the downloaded
files.
I am not sure what configuration change would cause it to store the files
in
local filesystem.

I guess, the last mail gave an incorrect impression that I want to
replicate
Nutch. But instead I want to use it in a web-application (probably keep it re-usable, so that plugging it in a different web-app is easy). The web-app
is supposed to allow its users to download any web-page they find
interesting, and later be able to view them by doing a keyword search over them. Since I couldn't locate any good tutorial/guide on how to go by using the Nutch APIs, I thought of sending a mail to user group to see if any one
can answer it.

- Amit

On Fri, Feb 19, 2010 at 2:36 PM, Paul Dhaliwal <subp...@gmail.com> wrote:

You might want it look at Nutch. I think they just read this email and
implemented it ;)

Paul


On Thu, Feb 18, 2010 at 7:17 PM, Amit Agarwal <aagarwa...@gmail.com>
wrote:

Hi,

I am a newbie to Nutch and Lucene. Have a task to build a framework for
webpage caching on local system (i.e. download and store webpage in
local
filesystem), indexing (index pages on keywords), search (search the
local
webpage cache using the keywords). The preference would be to build
framework using Java API available in third party jars.

On first glance, it seems Nutch+Hadoop+Lucene should be a good option
to
build this framework. Do you think it is a right option? Any ideas,
links
would be appreciated.

Regards,
Amit




Reply via email to