Hi,
I just set this up and it is working, we are testing Nutch for our
intranet this week.

The subcollection adds a field to the index, not just a search query
value, the indexes must be created after you enable the plugin. 

For testing you can use:
if subcollection  "dbbugzilla" contains a document with the word
"windows"

#sh export JAVA_HOME=/usr/local/java
#sh bin/nutch  org.apache.nutch.searcher.NutchBean "windows
subcollection:dbbugzilla"

will print search results to stdout


Or from the web
http://localhost:8080/nutch-0.8/opensearch?query=windows%20subcollection
:dbbugzilla

will give you an xml doc of output


Our configs
######## nutch-site.xml ##########
<property>
                <name>plugin.includes</name>
        
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-(b
asic)|query-(basic|site|url)|summary-basic|scoring-opic|subcollection</v
alue>
                <description>
                        Regular expression naming plugin directory names
to include. Any plugin not matching this expression is excluded. In any
case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain
                        text via HTTP, and basic indexing and search
plugins.
                </description>
        </property> 

######### subcollections.xml ###############
<subcollections>
        <subcollection>
                <name>dbbugzilla</name>
                <id>dbbugzilla</id>
        
<whitelist>http://www.yournamehere.com/nutch/searchResultsBug.cfm</white
list>
                <blacklist />
        </subcollection>
        <subcollection>
                <name>dboink</name>
                <id>dboink</id>
        
<whitelist>http://www.yournamehere.com/nutch/searchResultsOink.cfm</whit
elist>
                <blacklist />
        </subcollection>
</subcollections>

Mark Jones
Sr. Systems Integration Specialist
[EMAIL PROTECTED]



-----Original Message-----
From: Bud Witney [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 14, 2006 8:53 AM
To: [email protected]
Subject: Subcollection setup and use

Any one have success with the subcollections plugin in 8 if so how have
you setup and how do you query

I with below settings.

<property>
   <name>plugin.includes</name>
   <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|swf| 
msword|mspowerpoint|rss)|index-(basic|more)|query-(basic|site|url|
more)|subcollection|clustering-carrot2|summary-basic|scoring-opic</ 
value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints
plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins.
   </description>
</property>

For querying I tried collection:{collection name} term, subcollection: 
{collection name} term , and {collection name}: term

the later had best results but did not seem to restrict to only the
collection. It found items outside of the collection

do I need to blacklist all others or is it a query /setup issue

-Bud
  

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to