Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_prune The comment on the change is: fixed classpath to org.apache ------------------------------------------------------------------------------ - prune is an alias for net.nutch.tools.!PruneIndexTool + prune is an alias for org.apache.nutch.tools.!PruneIndexTool This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped. - NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax. + NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link org.apache.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax. If additional level of control is required, an instance of !PruneChecker can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - !PrintFieldsChecker prints the values of selected index fields, !StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options. - Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title[[BR}} + Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title[[BR}} This command will just print out fields of matching documents. - Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -queries queries.txt[[BR]] + Typical Useage: bin/nutch org.apache.nutch.tools.!PruneIndexTool index_dir -queries queries.txt[[BR]] This command will actually remove all matching entries, according to the queries read from queries.txt file. NOTE 2: This tool removes matching documents ONLY from segment indexes (or from a merged index). In particular it does NOT remove the pages and links from WebDB. This means that unwanted URLs may pop up again when new segments are created. To prevent this, use your own link net.nutch.net.URLFilter, or PruneDBTool (under construction...).
