[
https://issues.apache.org/jira/browse/NUTCH-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Julien Nioche reassigned NUTCH-1815:
------------------------------------
Assignee: Julien Nioche
> Metadata Parsed with parse-tika is Duplicated
> ---------------------------------------------
>
> Key: NUTCH-1815
> URL: https://issues.apache.org/jira/browse/NUTCH-1815
> Project: Nutch
> Issue Type: Bug
> Components: indexer, parser
> Affects Versions: 1.8
> Reporter: Jonathan Cooper-Ellis
> Assignee: Julien Nioche
> Priority: Minor
>
> When Nutch is configured to parse metatags and index metadata from HTML
> documents, disabling parse-html (and using parse-tika instead) causes each
> metadata field to be indexed twice with identical content.
> I only modified plugin.includes (description and keywords metatags are
> included in nutch-site.xml by default, so I did not modify those):
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> <description>...</description>
> </property>
> Sample output:
> $ bin/nutch indexchecker
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> fetching:
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> parsing:
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> contentType: text/html
> content : Commonwealth Fund survey: Obamacare helped 9.5 million
> Americans get health insurance, thanks to exc
> title : Commonwealth Fund survey: Obamacare helped 9.5 million
> Americans get health insurance, thanks to exc
> host : www.bizjournals.com
> tstamp : Thu Jul 10 17:34:56 UTC 2014
> metatag.description : A new survey by the Commonwealth Fund found that 9.5
> million previously uninsured Americans got cove
> metatag.description : A new survey by the Commonwealth Fund found that 9.5
> million previously uninsured Americans got cove
> url :
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-
> In this case, metatag.description appears twice. If parse-html is added back
> to plugin.includes and the same command is run, metatag.description will only
> appear once.
--
This message was sent by Atlassian JIRA
(v6.2#6252)