Re: Filtering webpages based on words / Fetch progress

Stefan Neufeind Thu, 08 Jun 2006 16:29:04 -0700

Lukas Vlcek wrote:

Hi again,


On 6/8/06, Mehdi Hemani <[EMAIL PROTECTED]> wrote:

1. I want to filter out webpages based on a list of words. I have
tried filtering webpages based on url, but how to do it based on
words?


As for this question check the following link:
http://wiki.apache.org/nutch/CommandLineOptions

As far as I know this prune tool should be available for nutch 0.8 as
well (at least I can see the class to be included in source code so
you should be able to call it).

Pruning with 0.8-dev works fine here. You give it a file with your"queries" and all matching pages will be pruned from the index. There isalso a dryrun-option available - use that when building your queries :-)

Note that documents are only pruned from the index, not from segments orthe crawldb! So upon re-indexing or running another crawler-round besure to apply pruning again.


  Stefan

Re: Filtering webpages based on words / Fetch progress

Reply via email to