Don't have the answer, but got a question. Does this happen only when redirection to the external host are involved?
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Drew Hite <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, June 16, 2008 1:09:34 PM > Subject: db.ignore.external.links=true and redirects > > Hello, > I would like restrict a crawl to a domain specified in a seed url without > using the urlfilter-regex plugin. The db.ignore.external.links property > looked like it would do the trick, but I've found that links that are > redirected outside the seed url get through. For example, if I start at > http://www.xyz.com and Nutch finds a link pointing to > http://www.xyz.com/blog which is actually a redirection to > http://blog.xyz.com then Nutch will start fetching pages from > http://blog.xyz.com even though it was not in seed url file. Is this the > intended behavior for the db.ignore.external.links property? If so, is > there a way to restrict a crawl to particular site without the regex > filter? If not, would it be useful to create a patch to check the toUrl > hosts against the hosts specified in the original seed list? > > Thanks, > Drew
