Continuing the noble tradition of replying to my own messages, I have a small update on the topic of the crawler crawling outside of the given list of hosts in spite of db.ignore.external.links=true...
2006/10/25, Tomi NA <[EMAIL PROTECTED]>:
> Could you give an example of a root URL, which leads to this symptom > (i.e. leaks outside the original site)? I'll try to find out exactly where the crawler starts to run loose as I have several web sites in my initial URL list.
I'm using nutch 0.9 now and have run into the problem again. It's a bit hard to reproduce as I have dozens of hosts in my initial URL list and the crawler leaves them days after I start the crawl: it's very difficult to pinpoint how or why the crawler steps outside it's bounds. Did anyone else run into such a problem? Is there anything else I needed to do set up except db.ignore.external.links=true? TIA, t.n.a.
