Hello, I would like restrict a crawl to a domain specified in a seed url without using the urlfilter-regex plugin. The db.ignore.external.links property looked like it would do the trick, but I've found that links that are redirected outside the seed url get through. For example, if I start at http://www.xyz.com and Nutch finds a link pointing to http://www.xyz.com/blog which is actually a redirection to http://blog.xyz.com then Nutch will start fetching pages from http://blog.xyz.com even though it was not in seed url file. Is this the intended behavior for the db.ignore.external.links property? If so, is there a way to restrict a crawl to particular site without the regex filter? If not, would it be useful to create a patch to check the toUrl hosts against the hosts specified in the original seed list?
Thanks, Drew
