Lukas Vlcek wrote:
Hi again,
On 6/8/06, Mehdi Hemani <[EMAIL PROTECTED]> wrote:
1. I want to filter out webpages based on a list of words. I have
tried filtering webpages based on url, but how to do it based on
words?
As for this question check the following link:
http://wiki.apache.org/nutch/CommandLineOptions
As far as I know this prune tool should be available for nutch 0.8 as
well (at least I can see the class to be included in source code so
you should be able to call it).
Pruning with 0.8-dev works fine here. You give it a file with your
"queries" and all matching pages will be pruned from the index. There is
also a dryrun-option available - use that when building your queries :-)
Note that documents are only pruned from the index, not from segments or
the crawldb! So upon re-indexing or running another crawler-round be
sure to apply pruning again.
Stefan