[Nutch Wiki] Update of "bin/nutch prune" by RobPettengill

Apache Wiki Sat, 16 Jul 2005 17:45:10 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_prune

New page:
prune is an alias for net.nutch.tools.!PruneIndexTool

This tool prunes existing Nutch indexes of unwanted content. The main method 
accepts a list of segment directories (containing indexes). These indexes will 
be pruned of any content that matches one or more query from a list of Lucene 
queries read from a file (defined in standard config file, or explicitly 
overridden from command-line). Segments should already be indexed, if some of 
them are missing indexes then these segments will be skipped.

NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge of 
available Lucene document fields is required. This can be obtained by reading 
sources of index-basic and index-more plugins, or using tools like Luke. During 
query parsing a WhitespaceAnalyzer is used - this choice has been made to 
minimize side effects of Analyzer on the final set of query terms. You can use 
link net.nutch.searcher.Query.main(String[]) method to translate queries in 
Nutch syntax to queries in Lucene syntax.
If additional level of control is required, an instance of [EMAIL PROTECTED] 
PruneChecker} can be provided to check each document before it's deleted. The 
results of all checkers are logically AND-ed, which means that any checker in 
the chain can veto the deletion of the current document. Two example checker 
implementations are provided - PrintFieldsChecker prints the values of selected 
index fields, StoreUrlsChecker stores the URLs of deleted documents to a file. 
Any of them can be activated by providing respective command-line options.


Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun 
-queries queries.txt -showfields url,title[[BR}}
This command will just print out fields of matching documents.

Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -queries 
queries.txt[[BR]]
This command will actually remove all matching entries, according to the 
queries read from queries.txt file.

NOTE 2: This tool removes matching documents ONLY from segment indexes (or from 
a merged index). In particular it does NOT remove the pages and links from 
WebDB. This means that unwanted URLs may pop up again when new segments are 
created. To prevent this, use your own link net.nutch.net.URLFilter, or 
PruneDBTool (under construction...).

NOTE 3: This tool uses a low-level Lucene interface to collect all matching 
documents. For large indexes and broad queries this may result in high memory 
consumption. If you encounter OutOfMemory exceptions, try to narrow down your 
queries, or increase the heap size.

[CommandLineOptions]

[Nutch Wiki] Update of "bin/nutch prune" by RobPettengill

Reply via email to