[ http://issues.apache.org/jira/browse/NUTCH-201?page=all ]

Sami Siren closed NUTCH-201.
----------------------------


> add support for subcollections
> ------------------------------
>
>                 Key: NUTCH-201
>                 URL: http://issues.apache.org/jira/browse/NUTCH-201
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 0.8
>            Reporter: Sami Siren
>         Assigned To: Sami Siren
>            Priority: Minor
>             Fix For: 0.8
>
>         Attachments: subcollections-1.patch, subcollections.2.patch
>
>
> Subcollection is a subset of an index. Subcollections are defined
> by urlpatterns in form of white/blacklist. So to get the page into
> subcollection it must match the whitelist and not the blacklist.
> Subcollection definitions are read from a file subcollections.xml
> and the format is as follows (imagine here that you are crawling all
> the virtualhosts from apache.org and you wan't to tag pages with
> url pattern "http://lucene.apache.org/"; to be part of subcollection
> lucene.
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
>        <subcollection>
>                <name>lucene</name>
>                <id>lucene</id>
>                <whitelist>http://lucene.apache.org/</whitelist>
>                <blacklist />
>        </subcollection>
> </subcollections>
> plugin contains indexingfilter, query filter and supporting classes

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to