Take a look at Heritrix and NutchWAX, I recommend them for your task.
Best regards,
---
Andreas P. Koenzen
On 19/02/2010, at 07:58 a.m., Paul Dhaliwal wrote:
Hi Amit,
I understand your question a little better now.
I think you probably want to take a look at NDFS :
http://wiki.apache.org/nutch/NutchDistributedFileSystem
and WebDB doc would probably help you understand thing also:
http://wiki.apache.org/nutch/DistributedWebDB
Hope this helps.
Paul
On Fri, Feb 19, 2010 at 2:03 AM, Amit Agarwal <aagarwa...@gmail.com>
wrote:
Hi Paul,
Yes on first read, it looks like a good option. I have downloaded
the nutch
from their official download site.
I believe that Fetcher is the class which is responsible for
downloading
any
web content. I successfully ran the test cases in TestFetcher
class, and
they execute with no issues. But I am unable to locate the downloaded
files.
I am not sure what configuration change would cause it to store the
files
in
local filesystem.
I guess, the last mail gave an incorrect impression that I want to
replicate
Nutch. But instead I want to use it in a web-application (probably
keep it
re-usable, so that plugging it in a different web-app is easy). The
web-app
is supposed to allow its users to download any web-page they find
interesting, and later be able to view them by doing a keyword
search over
them. Since I couldn't locate any good tutorial/guide on how to go
by using
the Nutch APIs, I thought of sending a mail to user group to see if
any one
can answer it.
- Amit
On Fri, Feb 19, 2010 at 2:36 PM, Paul Dhaliwal <subp...@gmail.com>
wrote:
You might want it look at Nutch. I think they just read this email
and
implemented it ;)
Paul
On Thu, Feb 18, 2010 at 7:17 PM, Amit Agarwal <aagarwa...@gmail.com>
wrote:
Hi,
I am a newbie to Nutch and Lucene. Have a task to build a
framework for
webpage caching on local system (i.e. download and store webpage in
local
filesystem), indexing (index pages on keywords), search (search the
local
webpage cache using the keywords). The preference would be to build
framework using Java API available in third party jars.
On first glance, it seems Nutch+Hadoop+Lucene should be a good
option
to
build this framework. Do you think it is a right option? Any ideas,
links
would be appreciated.
Regards,
Amit