prashant_nutch wrote:
IS Subcollection useful for specific URL Searching ?
How we activate subcollection at indexing and searching time?

in conf/subcollection , if we include our URL in whitelist ,then only we have search on that URLs?
command for searching on subcollection

Subcollection :< Name of subcollection> < word for specific URL>


<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
        <subcollection>
                <name>nutch</name>
                <id>nutch</id>
                <whitelist>
                                           http://lucene.apache.org/nutch/
                                           http://wiki.apache.org/nutch/
                                </whitelist>
                <blacklist />
        </subcollection>
</subcollections>

can anybody explain how overall thing should work ?
can it is useful for specific URL searching ?(we are using nutch 0.8.1)

Subcollection is a very useful way to group a set of urls and then assign a label for them. You can use it to limit searching to certain urls.

You should first enable subcollection in the nutch-site.xml file.
Then you should add collections to the conf/subcollection.xml file.
After indexing, the documents with the matched urls should have the subcollection field in the index. After that, since subcollection also includes a query plugin, you can do searches like

     java subcollection:nutch

To limit the search to the nutch collection. You can consult the readme file in the plugin's directory.





Reply via email to