Matthew Holt wrote: > Hi all, > I have already successfuly indexed all the files on my domain only (as > specified in the conf/crawl-urlfilter.txt file). > > Now when I use the below script (./recrawl crawl 10 31) to recrawl the > domain, it begins indexing pages off of my domain (such as wikipedia, > etc). How do I prevent this? Thanks!
Hi Matt, have a look at regex-urlfilter. "crawl" is special in some ways. Actually it's "shortcut" for several steps. And it has a special urlfilter-file. But if you do it in several steps that urlfilter-file is no longer used. Regards, Stefan _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
