Stefan, Thanks a bunch! I see what you mean.. matt Stefan Neufeind wrote:
>Matthew Holt wrote: > > >>Hi all, >> I have already successfuly indexed all the files on my domain only (as >>specified in the conf/crawl-urlfilter.txt file). >> >>Now when I use the below script (./recrawl crawl 10 31) to recrawl the >>domain, it begins indexing pages off of my domain (such as wikipedia, >>etc). How do I prevent this? Thanks! >> >> > >Hi Matt, > >have a look at regex-urlfilter. "crawl" is special in some ways. >Actually it's "shortcut" for several steps. And it has a special >urlfilter-file. But if you do it in several steps that urlfilter-file is >no longer used. > > >Regards, > Stefan > > > _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
