OK, I have spent a fair amount of time trying to figure out how to create the correct Lucene queries to use with the PruneIndexTool. I have read the wiki page for bin/nutch Prune, looked at the Lucence Query Parser Syntax page and browsed past mailing list discussions on the subject.
Accordingly, I have used bin/nutch org.apache.nutch.searcher.Query to create queries for a specific URL or a specific directory. I enter the URL or directory at the Query prompt and then copy the +(url:"*") section of the output into my queries.txt file. However, I am still at a loss for how to create the proper lucene queries for GROUPS of files and folders. Here a some of the most common groupings of files and/or directories I am trying to prune from my index. It would be great if anyone could suggest the correct lucene query to use and/or how to figure out these types of queries. 1. I want to prune the URL "http://www.testsite.com/testdir/", but I don't want to prune any other files in the /testdir/ directory. 2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/ (meaning the following URLs would be pruned): http://www.testsite.com/20/ http://www.testsite.com/21/ ... http://www.testsite.com/39/ http://www.testsite.com/40]/ I would even settle for the following URLs being pruned: http://www.testsite.com/??/ 3. I want to prune the URLs "http://www.testsite.com/*.php" Either just in this directory, or recursively through all sub-directories (ideally I would like to know how to do both). Any help is much appreciated! -Bryan
