Lukas Vlcek wrote:
> Hi again,
> 
> On 6/8/06, Mehdi Hemani <[EMAIL PROTECTED]> wrote:
>> 1. I want to filter out webpages based on a list of words. I have
>> tried filtering webpages based on url, but how to do it based on
>> words?
> 
> As for this question check the following link:
> http://wiki.apache.org/nutch/CommandLineOptions
> 
> As far as I know this prune tool should be available for nutch 0.8 as
> well (at least I can see the class to be included in source code so
> you should be able to call it).

Pruning with 0.8-dev works fine here. You give it a file with your 
"queries" and all matching pages will be pruned from the index. There is 
also a dryrun-option available - use that when building your queries :-)

Note that documents are only pruned from the index, not from segments or 
the crawldb! So upon re-indexing or running another crawler-round be 
sure to apply pruning again.

   Stefan


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to