Hi,

It seems you don't want to use the URLs generated from the crawl db
itself. In that case "bin/nutch freegen" command would be helpful to
you.

If you are using the "bin/nutch crawl" command, the
'conf/regex-urlfilter.txt' file wouldn't be used. The
'conf/crawl-urlfilter.txt' file would be used instead.

Regards,
Susam Pal

On Jan 10, 2008 6:34 PM,
[EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> Hi there,
>
> I'm actually having weird problems with my recrawl procedure (nutch0.9).
>
> The situation is the following:
>
> First, I crawl a couple of domains. Then, I start a seperate crawl with a 
> pages resulting from the first crawl and finally merge these two crawls.
>
> What I basically want to achieve now is to frequently update (refetch!!) the 
> crawl resulting from the merge-procedure without adding new urls to it. The 
> problem now is that while executing the recrawl-procedure, nutch is 
> fetching/indexing new urls I don't want to have in my crawl. The -noaddionts 
> parameter doesn't help me, because the crawldb seems to already contain more 
> urls than actually indexed (and these urls are going to be injected 
> initially). So my approach was to use the regex-urlfilter.txt, but somehow 
> the recrawl procedure doesn't consider this file (all new urls are 
> fetched/indexed anyway). The parameters were depth 1 and adddays 1. If 
> somebody knows how to limit the nutch-recrawl-procedure, please let me know.
>
> The second problem I faced was with the adddays parameter. A recrawl with 
> depth 0 and adddays 31 doesn't make nutch to refetch the urls. If I change 
> the depth to 1, I face the problems described above, but nutch doesn't 
> refetch the 'orginial' pages either.
>
> So, does anybody know how to solve these problems??
>
> Thanks for your help!
>
> Regards,
> Chris
>
>
>
>
>

Reply via email to