Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by RobPettengill: http://wiki.apache.org/nutch/bin/nutch_prune ------------------------------------------------------------------------------ This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped. - NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax. + NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax. - If additional level of control is required, an instance of [EMAIL PROTECTED] PruneChecker} can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - PrintFieldsChecker prints the values of selected index fields, StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options. + If additional level of control is required, an instance of !PruneChecker can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - !PrintFieldsChecker prints the values of selected index fields, !StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options. Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title[[BR}}
