add support for subcollections
------------------------------

         Key: NUTCH-201
         URL: http://issues.apache.org/jira/browse/NUTCH-201
     Project: Nutch
        Type: New Feature
    Versions: 0.8-dev    
    Reporter: Sami Siren
 Assigned to: Sami Siren 
    Priority: Minor
     Fix For: 0.8-dev


Subcollection is a subset of an index. Subcollections are defined
by urlpatterns in form of white/blacklist. So to get the page into
subcollection it must match the whitelist and not the blacklist.

Subcollection definitions are read from a file subcollections.xml
and the format is as follows (imagine here that you are crawling all
the virtualhosts from apache.org and you wan't to tag pages with
url pattern "http://lucene.apache.org/"; to be part of subcollection
lucene.

<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
       <subcollection>
               <name>lucene</name>
               <id>lucene</id>
               <whitelist>http://lucene.apache.org/</whitelist>
               <blacklist />
       </subcollection>
</subcollections>

plugin contains indexingfilter, query filter and supporting classes


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to