Hi there,
I'm actually having weird problems with my recrawl procedure (nutch0.9).
The situation is the following:
First, I crawl a couple of domains. Then, I start a seperate crawl with a pages
resulting from the first crawl and finally merge these two crawls.
What I basically want to achieve now is to frequently update (refetch!!) the
crawl resulting from the merge-procedure without adding new urls to it. The
problem now is that while executing the recrawl-procedure, nutch is
fetching/indexing new urls I don't want to have in my crawl. The -noaddionts
parameter doesn't help me, because the crawldb seems to already contain more
urls than actually indexed (and these urls are going to be injected initially).
So my approach was to use the regex-urlfilter.txt, but somehow the recrawl
procedure doesn't consider this file (all new urls are fetched/indexed anyway).
The parameters were depth 1 and adddays 1. If somebody knows how to limit the
nutch-recrawl-procedure, please let me know.
The second problem I faced was with the adddays parameter. A recrawl with depth
0 and adddays 31 doesn't make nutch to refetch the urls. If I change the depth
to 1, I face the problems described above, but nutch doesn't refetch the
'orginial' pages either.
So, does anybody know how to solve these problems??
Thanks for your help!
Regards,
Chris