Jonathan Cooper-Ellis created NUTCH-1815:
--------------------------------------------
Summary: Metadata Parsed with parse-tika is Duplicated
Key: NUTCH-1815
URL: https://issues.apache.org/jira/browse/NUTCH-1815
Project: Nutch
Issue Type: Bug
Components: indexer, parser
Affects Versions: 1.8
Reporter: Jonathan Cooper-Ellis
Priority: Minor
When Nutch is configured to parse metatags and index metadata from HTML
documents, disabling parse-html (and using parse-tika instead) causes each
metadata field to be indexed twice with identical content.
I only modified plugin.includes (description and keywords metatags are included
in nutch-site.xml by default, so I did not modify those):
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>...</description>
</property>
Sample output:
$ bin/nutch indexchecker
http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
fetching:
http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
parsing:
http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
contentType: text/html
content : Commonwealth Fund survey: Obamacare helped 9.5 million
Americans get health insurance, thanks to exc
title : Commonwealth Fund survey: Obamacare helped 9.5 million Americans get
health insurance, thanks to exc
host : www.bizjournals.com
tstamp : Thu Jul 10 17:34:56 UTC 2014
metatag.description : A new survey by the Commonwealth Fund found that 9.5
million previously uninsured Americans got cove
metatag.description : A new survey by the Commonwealth Fund found that 9.5
million previously uninsured Americans got cove
url :
http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-
In this case, metatag.description appears twice. If parse-html is added back to
plugin.includes and the same command is run, metatag.description will only
appear once.
--
This message was sent by Atlassian JIRA
(v6.2#6252)