As far as I know crawl - (named Intranet crawling in tutorial) - assumes
you refetch everything from scratch every time you run it. Whole Web
crawling allows you to control what you want to crawl and recrawl with
more details but some parameters might not work as I would expect (eg.
-refetchonly). Support for checking if page was modified from last fetch
time is currently missing (although as I understand there is some work
going on in this direction: http://issues.apache.org/jira/browse/NUTCH-61 )
Regards
Piotr
[EMAIL PROTECTED] wrote:
Hello,
I have a newbie question:
I have launched and completed an intranet crawling (bin/nutch crawl mySite
myDB).
Since I would like to recrawl in a few days, I changed the nutch default
parameter to 3 days (instead of 30).
How do I perform the recrawl? Do I just launch a new intranet crawling using the same parameters?
If I do, will the fetching only download new or modified pages, or will it download everything again?
Thanks for any help
Isabelle
[EMAIL PROTECTED]
Ph: 651 687 3424