Lukas Vlcek wrote:
Hi again,

On 6/8/06, Mehdi Hemani <[EMAIL PROTECTED]> wrote:
1. I want to filter out webpages based on a list of words. I have
tried filtering webpages based on url, but how to do it based on
words?

As for this question check the following link:
http://wiki.apache.org/nutch/CommandLineOptions

As far as I know this prune tool should be available for nutch 0.8 as
well (at least I can see the class to be included in source code so
you should be able to call it).

Pruning with 0.8-dev works fine here. You give it a file with your "queries" and all matching pages will be pruned from the index. There is also a dryrun-option available - use that when building your queries :-)

Note that documents are only pruned from the index, not from segments or the crawldb! So upon re-indexing or running another crawler-round be sure to apply pruning again.

  Stefan

Reply via email to