Three new plugins that parse, index and query meta tags defined in the
configuration
------------------------------------------------------------------------------------
Key: NUTCH-260
URL: http://issues.apache.org/jira/browse/NUTCH-260
Project: Nutch
Type: New Feature
Components: indexer, searcher
Versions: 0.7.2
Environment: Built and tested on Linux so far.
Reporter: Jake Vanderdray
Priority: Minor
These plugins allow you to define meta tags in you're nutch-site file that you
want to include in parseing, indexing and searching. The query plugin must
replace query-basic. The format for adding query terms to nutch-site.xml is:
<property>
<name>meta.names</name>
<value>keywords,recommended</value>
<description>This is a comma seperated list of meta tag names that will
be parsed, indexed and searched against when parse-meta, index-meta and
query-meta are used.</description>
</property>
<property>
<name>meta.boosts</name>
<value>1.0,5.0</value>
<description>Comma seperated list of boost values when searching using
query-meta. The order of the values should match the order of meta.names.
</description>
</property>
Meta tags found are assumed to have either a single value or be a comma
seperated list of values. The values found are added to the index as lucene
keywords (i.e. meta name=keywords values="First Thing, Second Thing" would
result in two keyword fields named "keywords". The first would countain "First
Thing" and the second would contain "Second Thing").
I had to replace the query-basic plugin in order to allow matches in the meta
fields to return hits even if there were no matches in any of the default
fields. The query-basic field only returns hits when every search term is
found in at least one default field. I needed hits returned if matches were
found in at least one field for every term, and/or the entire search phrase
appeared in a meta index field.
One known bug is that common terms are not getting stripped out of the fields'
values before they get indexed, so "The Next Big Thing" could not be matched
because the query engine will strip out "the" from all queries. I intend to
fix this by stipping out common terms from meta fields before indexing them.
Another issue is that searching for "Next Big Thing" would not match meta index
values for "Next", "Big" or "Thing". You can consider that a bug or a feature
depending on how you look at it.
These plugins were written for and only work on the 0.7.2 branch.
I'm going to attache a tarball of the source of these three plugins after I
create the issue. To use the plugins, you'll need to untar them in your
src/plugins directory and add them to the ant build.xml directive (and of
course add them in your nutch-site.xml file). If these end up getting added to
the project, I'll write up documentation on the wiki.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers