Bryan Woliner wrote:
1. I want to prune the URL "http://www.testsite.com/testdir/", but I
don't want to prune any other files in the /testdir/ directory.
2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/
I think you are just unlucky, in the sense that the PruneIndexTool was
created with a different goal in mind - namely, to remove offensive or
unwanted content containing certain query terms. Due to the way URLs are
tokenized it is indeed quite difficult to construct queries that match
specific groups of URLs.
I would suggest the following:
* use a query "url:http url:https", which is a handy trick to retrieve
all URLs (if you use other protocols, then add them here).
* implement a PruneChecker, which checks URLs according to a list of
regexps.
This should do it. You can lift some code from urlfilter-regex plugin,
like reading the regexes, checking them, etc.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general