ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags
and their subsequent indexing.
-------------------------------------------------------------------------------------------------------------
Key: NUTCH-855
URL: https://issues.apache.org/jira/browse/NUTCH-855
Project: Nutch
Issue Type: New Feature
Components: generator, indexer
Affects Versions: 1.1
Reporter: Scott Gonyea
Fix For: 1.2
This plugin is designed to enhance the NUTCH-655 patch, by doing two things:
1. Meta Tags that are supplied with your Crawl URLs, during injection, will be
propagated throughout the outlinks of those Crawl URLs.
2. When you index your URLs, the meta tags that you specified with your URLs
will be indexed alongside those URLs--and can be directly queried, assuming you
have done everything else correctly.
The flat-file of URLs you are injecting should, per NUTCH-655, be tab-delimited
in the form of:
[www.url.com]\t[key1]=[value1]\t[key2]=[value2]...[keyN]=[valueN]
or:
http://slashdot.org/ corp_owner=Geeknet will_it_blend=indubitably
http://engadget.com/ corp_owner=Weblogs genre=geeksquad_thriller
To activate this plugin, you must modify two properties in your nutch-sites.xml:
1. plugin.includes
from: index-(basic|anchor)
to: index-(basic|anchor|urlmeta)
2. urlmeta.tags
Insert a comma-delimited list of metatags. Using the above example:
<value>corp_owner, will_it_blend, genre</value>
Note that you do not need to include the tag with every URL. However, you
must specify each tag if you want it to be propagated and later indexed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.