add support for subcollections ------------------------------ Key: NUTCH-201 URL: http://issues.apache.org/jira/browse/NUTCH-201 Project: Nutch Type: New Feature Versions: 0.8-dev Reporter: Sami Siren Assigned to: Sami Siren Priority: Minor Fix For: 0.8-dev
Subcollection is a subset of an index. Subcollections are defined by urlpatterns in form of white/blacklist. So to get the page into subcollection it must match the whitelist and not the blacklist. Subcollection definitions are read from a file subcollections.xml and the format is as follows (imagine here that you are crawling all the virtualhosts from apache.org and you wan't to tag pages with url pattern "http://lucene.apache.org/" to be part of subcollection lucene. <?xml version="1.0" encoding="UTF-8"?> <subcollections> <subcollection> <name>lucene</name> <id>lucene</id> <whitelist>http://lucene.apache.org/</whitelist> <blacklist /> </subcollection> </subcollections> plugin contains indexingfilter, query filter and supporting classes -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira