Lukas Vlcek wrote: > Hi again, > > On 6/8/06, Mehdi Hemani <[EMAIL PROTECTED]> wrote: >> 1. I want to filter out webpages based on a list of words. I have >> tried filtering webpages based on url, but how to do it based on >> words? > > As for this question check the following link: > http://wiki.apache.org/nutch/CommandLineOptions > > As far as I know this prune tool should be available for nutch 0.8 as > well (at least I can see the class to be included in source code so > you should be able to call it).
Pruning with 0.8-dev works fine here. You give it a file with your "queries" and all matching pages will be pruned from the index. There is also a dryrun-option available - use that when building your queries :-) Note that documents are only pruned from the index, not from segments or the crawldb! So upon re-indexing or running another crawler-round be sure to apply pruning again. Stefan _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
