Hi, It seems you don't want to use the URLs generated from the crawl db itself. In that case "bin/nutch freegen" command would be helpful to you.
If you are using the "bin/nutch crawl" command, the 'conf/regex-urlfilter.txt' file wouldn't be used. The 'conf/crawl-urlfilter.txt' file would be used instead. Regards, Susam Pal On Jan 10, 2008 6:34 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hi there, > > I'm actually having weird problems with my recrawl procedure (nutch0.9). > > The situation is the following: > > First, I crawl a couple of domains. Then, I start a seperate crawl with a > pages resulting from the first crawl and finally merge these two crawls. > > What I basically want to achieve now is to frequently update (refetch!!) the > crawl resulting from the merge-procedure without adding new urls to it. The > problem now is that while executing the recrawl-procedure, nutch is > fetching/indexing new urls I don't want to have in my crawl. The -noaddionts > parameter doesn't help me, because the crawldb seems to already contain more > urls than actually indexed (and these urls are going to be injected > initially). So my approach was to use the regex-urlfilter.txt, but somehow > the recrawl procedure doesn't consider this file (all new urls are > fetched/indexed anyway). The parameters were depth 1 and adddays 1. If > somebody knows how to limit the nutch-recrawl-procedure, please let me know. > > The second problem I faced was with the adddays parameter. A recrawl with > depth 0 and adddays 31 doesn't make nutch to refetch the urls. If I change > the depth to 1, I face the problems described above, but nutch doesn't > refetch the 'orginial' pages either. > > So, does anybody know how to solve these problems?? > > Thanks for your help! > > Regards, > Chris > > > > >
