Matthew Holt wrote:
> Hi all,
>   I have already successfuly indexed all the files on my domain only (as
> specified in the conf/crawl-urlfilter.txt file).
> 
> Now when I use the below script (./recrawl crawl 10 31) to recrawl the
> domain, it begins indexing pages off of my domain (such as wikipedia,
> etc). How do I prevent this? Thanks!

Hi Matt,

have a look at regex-urlfilter. "crawl" is special in some ways.
Actually it's "shortcut" for several steps. And it has a special
urlfilter-file. But if you do it in several steps that urlfilter-file is
no longer used.


Regards,
 Stefan


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to