prashant_nutch wrote:
IS Subcollection useful for specific URL Searching ?
How we activate subcollection at indexing and searching time?
in conf/subcollection ,
if we include our URL in whitelist ,then only we have search on that URLs?
command for searching on subcollection
Subcollection :< Name of subcollection> < word for specific URL>
<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
<subcollection>
<name>nutch</name>
<id>nutch</id>
<whitelist>
http://lucene.apache.org/nutch/
http://wiki.apache.org/nutch/
</whitelist>
<blacklist />
</subcollection>
</subcollections>
can anybody explain how overall thing should work ?
can it is useful for specific URL searching ?(we are using nutch 0.8.1)
Subcollection is a very useful way to group a set of urls and then
assign a label for them. You can use it to limit searching to certain urls.
You should first enable subcollection in the nutch-site.xml file.
Then you should add collections to the conf/subcollection.xml file.
After indexing, the documents with the matched urls should have the
subcollection field in the index.
After that, since subcollection also includes a query plugin, you can do
searches like
java subcollection:nutch
To limit the search to the nutch collection. You can consult the readme
file in the plugin's directory.