As nutch crawls web pages from links to links by extracting outlinks from
the page.
For example, we can check if the link text contains some keywords from a
dictionary to decide whether or not to crawl it.
Moreover, we can check if the content of a page fetched by an outlink
contains some keywords from a dictionary.

I think this can be done by using a plug-in like url filter, but it seems to
cause the performance problem of the crawling process. So I'd like to listen
to your opinions. Is it possible or meaningful to crawl not just by links
but contents or terms?

Reply via email to