[Nutch Wiki] Update of "bin/nutch prune" by JerryRussell

Apache Wiki Mon, 09 Jan 2006 14:49:00 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by JerryRussell:
http://wiki.apache.org/nutch/bin/nutch_prune

The comment on the change is:
fixed classpath to org.apache

------------------------------------------------------------------------------
- prune is an alias for net.nutch.tools.!PruneIndexTool
+ prune is an alias for org.apache.nutch.tools.!PruneIndexTool
  
  This tool prunes existing Nutch indexes of unwanted content. The main method 
accepts a list of segment directories (containing indexes). These indexes will 
be pruned of any content that matches one or more query from a list of Lucene 
queries read from a file (defined in standard config file, or explicitly 
overridden from command-line). Segments should already be indexed, if some of 
them are missing indexes then these segments will be skipped.
  
- NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge 
of available Lucene document fields is required. This can be obtained by 
reading sources of index-basic and index-more plugins, or using tools like 
Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been 
made to minimize side effects of Analyzer on the final set of query terms. You 
can use link net.nutch.searcher.Query.main(String[]) method to translate 
queries in Nutch syntax to queries in Lucene syntax.
+ NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge 
of available Lucene document fields is required. This can be obtained by 
reading sources of index-basic and index-more plugins, or using tools like 
Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been 
made to minimize side effects of Analyzer on the final set of query terms. You 
can use link org.apache.nutch.searcher.Query.main(String[]) method to translate 
queries in Nutch syntax to queries in Lucene syntax.
  If additional level of control is required, an instance of !PruneChecker can 
be provided to check each document before it's deleted. The results of all 
checkers are logically AND-ed, which means that any checker in the chain can 
veto the deletion of the current document. Two example checker implementations 
are provided - !PrintFieldsChecker prints the values of selected index fields, 
!StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them 
can be activated by providing respective command-line options.
  
  
- Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun 
-queries queries.txt -showfields url,title[[BR}}
+ Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir 
-dryrun -queries queries.txt -showfields url,title[[BR}}
  This command will just print out fields of matching documents.
  
- Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -queries 
queries.txt[[BR]]
+ Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir 
-queries queries.txt[[BR]]
  This command will actually remove all matching entries, according to the 
queries read from queries.txt file.
  
  NOTE 2: This tool removes matching documents ONLY from segment indexes (or 
from a merged index). In particular it does NOT remove the pages and links from 
WebDB. This means that unwanted URLs may pop up again when new segments are 
created. To prevent this, use your own link net.nutch.net.URLFilter, or 
PruneDBTool (under construction...).

[Nutch Wiki] Update of "bin/nutch prune" by JerryRussell

Reply via email to