Problem with recrawl

[EMAIL PROTECTED] Thu, 10 Jan 2008 05:05:04 -0800

Hi there,

I'm actually having weird problems with my recrawl procedure (nutch0.9).


The situation is the following:

First, I crawl a couple of domains. Then, I start a seperate crawl with a pages 
resulting from the first crawl and finally merge these two crawls.

What I basically want to achieve now is to frequently update (refetch!!) the 
crawl resulting from the merge-procedure without adding new urls to it. The 
problem now is that while executing the recrawl-procedure, nutch is 
fetching/indexing new urls I don't want to have in my crawl. The -noaddionts 
parameter doesn't help me, because the crawldb seems to already contain more 
urls than actually indexed (and these urls are going to be injected initially). 
So my approach was to use the regex-urlfilter.txt, but somehow the recrawl 
procedure doesn't consider this file (all new urls are fetched/indexed anyway). 
The parameters were depth 1 and adddays 1. If somebody knows how to limit the 
nutch-recrawl-procedure, please let me know.

The second problem I faced was with the adddays parameter. A recrawl with depth 
0 and adddays 31 doesn't make nutch to refetch the urls. If I change the depth 
to 1, I face the problems described above, but nutch doesn't refetch the 
'orginial' pages either.

So, does anybody know how to solve these problems??

Thanks for your help!

Regards,
Chris

Problem with recrawl

Reply via email to