[ 
https://issues.apache.org/jira/browse/NUTCH-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14270871#comment-14270871
 ] 

Julien Nioche commented on NUTCH-1815:
--------------------------------------

Thanks for the patch [~ronvandervegt]. Not quite the right approach though as 
it would prevent the indexing of multiple values.  Better to do the check 
before iterating on the values. Could also apply the same logic to the method  
addIndexedMetatags(Metadata metadata, String metatag, String value)

> Metadata Parsed with parse-tika is Duplicated
> ---------------------------------------------
>
>                 Key: NUTCH-1815
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1815
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, parser
>    Affects Versions: 1.8
>            Reporter: Jonathan Cooper-Ellis
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: NUTCH-1815-1.9.patch
>
>
> When Nutch is configured to parse metatags and index metadata from HTML 
> documents, disabling parse-html (and using parse-tika instead) causes each 
> metadata field to be indexed twice with identical content.
> I only modified plugin.includes (description and keywords metatags are 
> included in nutch-site.xml by default, so I did not modify those):
> <property>
>   <name>plugin.includes</name>
>   
> <value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>...</description>
> </property>
> Sample output:
> $ bin/nutch indexchecker 
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> fetching: 
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> parsing: 
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-did-get.html
> contentType: text/html
> content :     Commonwealth Fund survey: Obamacare helped 9.5 million 
> Americans get health insurance, thanks to exc
> title :       Commonwealth Fund survey: Obamacare helped 9.5 million 
> Americans get health insurance, thanks to exc
> host :        www.bizjournals.com
> tstamp :      Thu Jul 10 17:34:56 UTC 2014
> metatag.description : A new survey by the Commonwealth Fund found that 9.5 
> million previously uninsured Americans got cove
> metatag.description : A new survey by the Commonwealth Fund found that 9.5 
> million previously uninsured Americans got cove
> url : 
> http://www.bizjournals.com/bizjournals/washingtonbureau/2014/07/yes-millions-of-uninsured-americans-
> In this case, metatag.description appears twice. If parse-html is added back 
> to plugin.includes and the same command is run, metatag.description will only 
> appear once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to