When you do the generate, fetch commands, are you doing and updatedb
command also and then multiple generate and fetch cycles? The depth 3
parameter automates this on the crawl command.
Dennis
José Mestre wrote:
Hi,
I'm using nutch to index part of an intranet website.
When I use the "crawl" command the database indexes 3000 documents:
e.g.: nutch crawl urls -dir crawl -threads 200 -depth 3
But when I do the same with the separate "generate, fetch, ..." commands I just
have 50 documents in the database:
e.g.: for example the crawl or recrawl script with adddays=31
http://wiki.apache.org/nutch/Crawl
http://wiki.apache.org/nutch/IntranetRecrawl
I've tried using fetch with option -noAdditions
Do someone know why this happen ?
I think crawl-urlfilter.txt ' and 'regex-urlfilter.txt' are ok.
Regards.
Jo