add support for subcollections
------------------------------
Key: NUTCH-201
URL: http://issues.apache.org/jira/browse/NUTCH-201
Project: Nutch
Type: New Feature
Versions: 0.8-dev
Reporter: Sami Siren
Assigned to: Sami Siren
Priority: Minor
Fix For: 0.8-dev
Subcollection is a subset of an index. Subcollections are defined
by urlpatterns in form of white/blacklist. So to get the page into
subcollection it must match the whitelist and not the blacklist.
Subcollection definitions are read from a file subcollections.xml
and the format is as follows (imagine here that you are crawling all
the virtualhosts from apache.org and you wan't to tag pages with
url pattern "http://lucene.apache.org/" to be part of subcollection
lucene.
<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
<subcollection>
<name>lucene</name>
<id>lucene</id>
<whitelist>http://lucene.apache.org/</whitelist>
<blacklist />
</subcollection>
</subcollections>
plugin contains indexingfilter, query filter and supporting classes
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira