Continuing the noble tradition of replying to my own messages, I have
a small update on the topic of the crawler crawling outside of the
given list of hosts in spite of db.ignore.external.links=true...

2006/10/25, Tomi NA <[EMAIL PROTECTED]>:

> Could you give an example of a root URL, which leads to this symptom
> (i.e. leaks outside the original site)?

I'll try to find out exactly where the crawler starts to run loose as
I have several web sites in my initial URL list.

I'm using nutch 0.9 now and have run into the problem again. It's a
bit hard to reproduce as I have dozens of hosts in my initial URL list
and the crawler leaves them days after I start the crawl: it's very
difficult to pinpoint how or why the crawler steps outside it's
bounds.

Did anyone else run into such a problem?
Is there anything else I needed to do set up except
db.ignore.external.links=true?

TIA,
t.n.a.

Reply via email to