Re: [Nutch-general] Recrawling question

Matthew Holt Tue, 06 Jun 2006 12:24:56 -0700

Stefan,
  Thanks a bunch! I see what you mean..
matt

Stefan Neufeind wrote:


>Matthew Holt wrote:
>  
>
>>Hi all,
>>  I have already successfuly indexed all the files on my domain only (as
>>specified in the conf/crawl-urlfilter.txt file).
>>
>>Now when I use the below script (./recrawl crawl 10 31) to recrawl the
>>domain, it begins indexing pages off of my domain (such as wikipedia,
>>etc). How do I prevent this? Thanks!
>>    
>>
>
>Hi Matt,
>
>have a look at regex-urlfilter. "crawl" is special in some ways.
>Actually it's "shortcut" for several steps. And it has a special
>urlfilter-file. But if you do it in several steps that urlfilter-file is
>no longer used.
>
>
>Regards,
> Stefan
>
>  
>


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Recrawling question

Reply via email to