[ http://issues.apache.org/jira/browse/NUTCH-260?page=all ]
Jake Vanderdray updated NUTCH-260:
----------------------------------
Attachment: nutch_customizations.tar
The attachment is a tarball of the plugin source.
> Three new plugins that parse, index and query meta tags defined in the
> configuration
> ------------------------------------------------------------------------------------
>
> Key: NUTCH-260
> URL: http://issues.apache.org/jira/browse/NUTCH-260
> Project: Nutch
> Type: New Feature
> Components: indexer, searcher
> Versions: 0.7.2
> Environment: Built and tested on Linux so far.
> Reporter: Jake Vanderdray
> Priority: Minor
> Attachments: nutch_customizations.tar
>
> These plugins allow you to define meta tags in you're nutch-site file that
> you want to include in parseing, indexing and searching. The query plugin
> must replace query-basic. The format for adding query terms to
> nutch-site.xml is:
> <property>
> <name>meta.names</name>
> <value>keywords,recommended</value>
> <description>This is a comma seperated list of meta tag names that will
> be parsed, indexed and searched against when parse-meta, index-meta and
> query-meta are used.</description>
> </property>
> <property>
> <name>meta.boosts</name>
> <value>1.0,5.0</value>
> <description>Comma seperated list of boost values when searching using
> query-meta. The order of the values should match the order of meta.names.
> </description>
> </property>
> Meta tags found are assumed to have either a single value or be a comma
> seperated list of values. The values found are added to the index as lucene
> keywords (i.e. meta name=keywords values="First Thing, Second Thing" would
> result in two keyword fields named "keywords". The first would countain
> "First Thing" and the second would contain "Second Thing").
> I had to replace the query-basic plugin in order to allow matches in the meta
> fields to return hits even if there were no matches in any of the default
> fields. The query-basic field only returns hits when every search term is
> found in at least one default field. I needed hits returned if matches were
> found in at least one field for every term, and/or the entire search phrase
> appeared in a meta index field.
> One known bug is that common terms are not getting stripped out of the
> fields' values before they get indexed, so "The Next Big Thing" could not be
> matched because the query engine will strip out "the" from all queries. I
> intend to fix this by stipping out common terms from meta fields before
> indexing them.
> Another issue is that searching for "Next Big Thing" would not match meta
> index values for "Next", "Big" or "Thing". You can consider that a bug or a
> feature depending on how you look at it.
> These plugins were written for and only work on the 0.7.2 branch.
> I'm going to attache a tarball of the source of these three plugins after I
> create the issue. To use the plugins, you'll need to untar them in your
> src/plugins directory and add them to the ant build.xml directive (and of
> course add them in your nutch-site.xml file). If these end up getting added
> to the project, I'll write up documentation on the wiki.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers