[ http://issues.apache.org/jira/browse/NUTCH-260?page=all ]

Jake Vanderdray updated NUTCH-260:
----------------------------------

    Attachment: nutch_customizations.tar

The attachment is a tarball of the plugin source.

> Three new plugins that parse, index and query meta tags defined in the 
> configuration
> ------------------------------------------------------------------------------------
>
>          Key: NUTCH-260
>          URL: http://issues.apache.org/jira/browse/NUTCH-260
>      Project: Nutch
>         Type: New Feature

>   Components: indexer, searcher
>     Versions: 0.7.2
>  Environment: Built and tested on Linux so far.
>     Reporter: Jake Vanderdray
>     Priority: Minor
>  Attachments: nutch_customizations.tar
>
> These plugins allow you to define meta tags in you're nutch-site file that 
> you want to include in parseing, indexing and searching.  The query plugin 
> must replace query-basic.  The format for adding query terms to 
> nutch-site.xml is:
> <property>
>   <name>meta.names</name>
>   <value>keywords,recommended</value>
>   <description>This is a comma seperated list of meta tag names that will
>   be parsed, indexed and searched against when parse-meta, index-meta and
>   query-meta are used.</description>
> </property>
> <property>
>   <name>meta.boosts</name>
>   <value>1.0,5.0</value>
>   <description>Comma seperated list of boost values when searching using
>   query-meta.  The order of the values should match the order of meta.names.
>   </description>
> </property>
> Meta tags found are assumed to have either a single value or be a comma 
> seperated list of values.  The values found are added to the index as lucene 
> keywords (i.e. meta name=keywords values="First Thing, Second Thing" would 
> result in two keyword fields named "keywords".  The first would countain 
> "First Thing" and the second would contain "Second Thing").
> I had to replace the query-basic plugin in order to allow matches in the meta 
> fields to return hits even if there were no matches in any of the default 
> fields.  The query-basic field only returns hits when every search term is 
> found in at least one default field.  I needed hits returned if matches were 
> found in at least one field for every term, and/or the entire search phrase 
> appeared in a meta index field.
> One known bug is that common terms are not getting stripped out of the 
> fields' values before they get indexed, so "The Next Big Thing" could not be 
> matched because the query engine will strip out "the" from all queries.  I 
> intend to fix this by stipping out common terms from meta fields before 
> indexing them.
> Another issue is that searching for "Next Big Thing" would not match meta 
> index values for "Next", "Big" or "Thing".  You can consider that a bug or a 
> feature depending on how you look at it.
> These plugins were written for and only work on the 0.7.2 branch.
> I'm going to attache a tarball of the source of these three plugins after I 
> create the issue.  To use the plugins, you'll need to untar them in your 
> src/plugins directory and add them to the ant build.xml directive (and of 
> course add them in your nutch-site.xml file).  If these end up getting added 
> to the project, I'll write up documentation on the wiki.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to