[Nutch-general] Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders

Bryan Woliner Mon, 16 Jan 2006 14:43:02 -0800

OK,

I have spent a fair amount of time trying to figure out how to create
the correct Lucene queries to use with the PruneIndexTool. I have read
the wiki page for bin/nutch Prune, looked at the Lucence Query Parser
Syntax page and browsed past mailing list discussions on the subject.


Accordingly, I have used bin/nutch org.apache.nutch.searcher.Query to
create queries for a specific URL or a specific directory. I enter the
URL or directory at the Query prompt and then copy the +(url:"*")
section of the output into my queries.txt file.

However, I am still at a loss for how to create the proper lucene
queries for GROUPS of files and folders.

Here a some of the most common groupings of files and/or directories I
am trying to prune from my index. It would be great if anyone could
suggest the correct lucene query to use and/or how to figure out these
types of queries.

1. I want to prune the URL "http://www.testsite.com/testdir/";, but I
don't want to prune any other files in the /testdir/ directory.

2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/

(meaning the following URLs would be pruned):

http://www.testsite.com/20/
http://www.testsite.com/21/
...
http://www.testsite.com/39/
http://www.testsite.com/40]/

I would even settle for the following URLs being pruned:
http://www.testsite.com/??/

3. I want to prune the URLs "http://www.testsite.com/*.php";

Either just in this directory, or recursively through all
sub-directories (ideally I would like to know how to do both).

Any help is much appreciated!

-Bryan


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders

Reply via email to