[ http://issues.apache.org/jira/browse/NUTCH-201?page=all ]
Sami Siren closed NUTCH-201. ---------------------------- > add support for subcollections > ------------------------------ > > Key: NUTCH-201 > URL: http://issues.apache.org/jira/browse/NUTCH-201 > Project: Nutch > Issue Type: New Feature > Affects Versions: 0.8 > Reporter: Sami Siren > Assigned To: Sami Siren > Priority: Minor > Fix For: 0.8 > > Attachments: subcollections-1.patch, subcollections.2.patch > > > Subcollection is a subset of an index. Subcollections are defined > by urlpatterns in form of white/blacklist. So to get the page into > subcollection it must match the whitelist and not the blacklist. > Subcollection definitions are read from a file subcollections.xml > and the format is as follows (imagine here that you are crawling all > the virtualhosts from apache.org and you wan't to tag pages with > url pattern "http://lucene.apache.org/" to be part of subcollection > lucene. > <?xml version="1.0" encoding="UTF-8"?> > <subcollections> > <subcollection> > <name>lucene</name> > <id>lucene</id> > <whitelist>http://lucene.apache.org/</whitelist> > <blacklist /> > </subcollection> > </subcollections> > plugin contains indexingfilter, query filter and supporting classes -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
