As nutch crawls web pages from links to links by extracting outlinks from the page. For example, we can check if the link text contains some keywords from a dictionary to decide whether or not to crawl it. Moreover, we can check if the content of a page fetched by an outlink contains some keywords from a dictionary.
I think this can be done by using a plug-in like url filter, but it seems to cause the performance problem of the crawling process. So I'd like to listen to your opinions. Is it possible or meaningful to crawl not just by links but contents or terms?
