OK, I have spent a fair amount of time trying to figure out how to create the correct Lucene queries to use with the PruneIndexTool. I have read the wiki page for bin/nutch Prune, looked at the Lucence Query Parser Syntax page and browsed past mailing list discussions on the subject.
Accordingly, I have used bin/nutch org.apache.nutch.searcher.Query to create queries for a specific URL or a specific directory. I enter the URL or directory at the Query prompt and then copy the +(url:"*") section of the output into my queries.txt file. However, I am still at a loss for how to create the proper lucene queries for GROUPS of files and folders. Here a some of the most common groupings of files and/or directories I am trying to prune from my index. It would be great if anyone could suggest the correct lucene query to use and/or how to figure out these types of queries. 1. I want to prune the URL "http://www.testsite.com/testdir/", but I don't want to prune any other files in the /testdir/ directory. 2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/ (meaning the following URLs would be pruned): http://www.testsite.com/20/ http://www.testsite.com/21/ ... http://www.testsite.com/39/ http://www.testsite.com/40]/ I would even settle for the following URLs being pruned: http://www.testsite.com/??/ 3. I want to prune the URLs "http://www.testsite.com/*.php" Either just in this directory, or recursively through all sub-directories (ideally I would like to know how to do both). Any help is much appreciated! -Bryan ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_idv37&alloc_id865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
