OK,

I have spent a fair amount of time trying to figure out how to create
the correct Lucene queries to use with the PruneIndexTool. I have read
the wiki page for bin/nutch Prune, looked at the Lucence Query Parser
Syntax page and browsed past mailing list discussions on the subject.

Accordingly, I have used bin/nutch org.apache.nutch.searcher.Query to
create queries for a specific URL or a specific directory. I enter the
URL or directory at the Query prompt and then copy the +(url:"*")
section of the output into my queries.txt file.

However, I am still at a loss for how to create the proper lucene
queries for GROUPS of files and folders.

Here a some of the most common groupings of files and/or directories I
am trying to prune from my index. It would be great if anyone could
suggest the correct lucene query to use and/or how to figure out these
types of queries.

1. I want to prune the URL "http://www.testsite.com/testdir/";, but I
don't want to prune any other files in the /testdir/ directory.

2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/

(meaning the following URLs would be pruned):

http://www.testsite.com/20/
http://www.testsite.com/21/
...
http://www.testsite.com/39/
http://www.testsite.com/40]/

I would even settle for the following URLs being pruned:
http://www.testsite.com/??/

3. I want to prune the URLs "http://www.testsite.com/*.php";

Either just in this directory, or recursively through all
sub-directories (ideally I would like to know how to do both).

Any help is much appreciated!

-Bryan

Reply via email to