Hello, Is there a way to tell Nutch to only follow links within the domain it is currently crawling?
What I would like to do is pass a list of Url's and Nutch should ignore all outbound links from any domain other than the domain that the link comes from. Let's say that I am crawling www.test1.com, I should only follow links to www.test1.com. I realize that I can do this with regex filter *if* I add a regex rule for *each* site that I want to crawl, this solution doesn't scale well for my project. I have also read of a db based url filter that will maintain a list of accepted url's in a database. This also doesn't fit well since I don't want to maintain the crawl list and the accepted domain database. I can but it is rather clunky. I have poked around the source and it looks like the url filtering mechanism only passes the link url and returns a url. So it appears that this is not really possible at the code level without source modifications. I would just like to confirm that I am not missing anything obvious before I start reworking the code. Cheers, Vince
