[Nutch Wiki] Trivial Update of "bin/nutch prune" by RobPettengill

Apache Wiki Sat, 16 Jul 2005 17:46:40 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_prune

------------------------------------------------------------------------------
  
  This tool prunes existing Nutch indexes of unwanted content. The main method 
accepts a list of segment directories (containing indexes). These indexes will 
be pruned of any content that matches one or more query from a list of Lucene 
queries read from a file (defined in standard config file, or explicitly 
overridden from command-line). Segments should already be indexed, if some of 
them are missing indexes then these segments will be skipped.
  
- NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge 
of available Lucene document fields is required. This can be obtained by 
reading sources of index-basic and index-more plugins, or using tools like 
Luke. During query parsing a WhitespaceAnalyzer is used - this choice has been 
made to minimize side effects of Analyzer on the final set of query terms. You 
can use link net.nutch.searcher.Query.main(String[]) method to translate 
queries in Nutch syntax to queries in Lucene syntax.
+ NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge 
of available Lucene document fields is required. This can be obtained by 
reading sources of index-basic and index-more plugins, or using tools like 
Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been 
made to minimize side effects of Analyzer on the final set of query terms. You 
can use link net.nutch.searcher.Query.main(String[]) method to translate 
queries in Nutch syntax to queries in Lucene syntax.
- If additional level of control is required, an instance of [EMAIL PROTECTED] 
PruneChecker} can be provided to check each document before it's deleted. The 
results of all checkers are logically AND-ed, which means that any checker in 
the chain can veto the deletion of the current document. Two example checker 
implementations are provided - PrintFieldsChecker prints the values of selected 
index fields, StoreUrlsChecker stores the URLs of deleted documents to a file. 
Any of them can be activated by providing respective command-line options.
+ If additional level of control is required, an instance of !PruneChecker can 
be provided to check each document before it's deleted. The results of all 
checkers are logically AND-ed, which means that any checker in the chain can 
veto the deletion of the current document. Two example checker implementations 
are provided - !PrintFieldsChecker prints the values of selected index fields, 
!StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them 
can be activated by providing respective command-line options.
  
  
  Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun 
-queries queries.txt -showfields url,title[[BR}}

[Nutch Wiki] Trivial Update of "bin/nutch prune" by RobPettengill

Reply via email to