Bryan Woliner wrote:

1. I want to prune the URL "http://www.testsite.com/testdir/";, but I
don't want to prune any other files in the /testdir/ directory.

2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/

I think you are just unlucky, in the sense that the PruneIndexTool was created with a different goal in mind - namely, to remove offensive or unwanted content containing certain query terms. Due to the way URLs are tokenized it is indeed quite difficult to construct queries that match specific groups of URLs.

I would suggest the following:

* use a query "url:http url:https", which is a handy trick to retrieve all URLs (if you use other protocols, then add them here).

* implement a PruneChecker, which checks URLs according to a list of regexps.

This should do it. You can lift some code from urlfilter-regex plugin, like reading the regexes, checking them, etc.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to