Newbie Question - Nutch Functionality

Karen Church Wed, 08 Jun 2005 03:05:46 -0700
Hi All,
 
I've just recently started using Nutch and so far so good : ) I have a few 
questions about the functionality provided by Nutch and was wondering if any of 
you could help?
 
1. I have ran a couple of whole-web crawling tests with a few urls and noticed 
that the actual pages/files are not saved/downloaded.  I understand that the 
content of the pages crawled is parsed and extracted.  Is there anyway to 
configure Nutch so that it downloads the files as well as carrying out the 
parsing/information extraction?
 
2. I want use Nutch to run an initial web crawl and then after a certain 
interval be it days/weeks, re-run the web crawl but this time log new pages 
added since the last crawl, pages that have been deleted or removed since the 
last crawl and changes to existing pages in the database. Does Nutch provide 
such functionality? If so, does any one have any pointers / is there any 
existing documentation that will help me get started?
 
3. I also want to crawl only specific content types - For example, can Nutch be 
configured so that it crawls only pdf files or xml files from a web site 
instead of everything from the site?
 
4. I understand that Nutch uses Lucene for its indexing requirements. Is it 
possible to crawl pages using Nutch and then implement a separate search 
strategy using Lucene? Is it relatively straightforward to hook up the two?
 
Any help you can provide me with regarding any of the questions above would be 
much appreciated.
 
Thanks in advance,
 
Karen
Newbie Question - Nutch Functionality

Reply via email to