[ 
https://issues.apache.org/jira/browse/NUTCH-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Bouchar updated NUTCH-2567:
----------------------------------
    Description: 
Using nutch witch the following configuration, MetaTagsParser writes HTML meta 
tags to the metadata twice:
{code:java}
    <property>
        <name>plugin.includes</name>
        <value>protocol-http|parse-(tika|metatags)</value>
    </property>
{code}
The problem seems to come from 
[MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111]
 :

Both the meta tags from the existing ParseResult and from the HTMLMetaTags are 
added to the metadata with a "metatag." prefix. But the ParseResult object 
already contains the HTML meta tags, because they have been added by TikaParser 
here: 
[TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206]

 

This bug is concerning, because it makes the segments uselessly big, especially 
if we want to store all metatags (by default, only metatag.description and 
metatag.keywords are stored, and thus duplicated).

  was:
Using nutch with the following configuration, MetaTagsParser writes HTML meta 
tags to the metadata twice:
{code:java}
    <property>
        <name>plugin.includes</name>
        <value>protocol-http|parse-(tika|metatags)</value>
    </property>
{code}
The problem seems to come from 
[MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/master/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111]
 :

Both the meta tags from the existing ParseResult and from the HTMLMetaTags are 
added to the metadata with a "metatag." prefix. But the ParseResult object 
already contains the HTML meta tags, because they have been added by TikaParser 
here: 
[TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206]

 

This bug is concerning, because it makes the segments uselessly big, especially 
if we want to store all metatags (by default, only metatag.description and 
metatag.keywords are stored, and thus duplicated).


> parse-metatags writes every meta tags twice
> -------------------------------------------
>
>                 Key: NUTCH-2567
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2567
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> Using nutch witch the following configuration, MetaTagsParser writes HTML 
> meta tags to the metadata twice:
> {code:java}
>     <property>
>         <name>plugin.includes</name>
>         <value>protocol-http|parse-(tika|metatags)</value>
>     </property>
> {code}
> The problem seems to come from 
> [MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111]
>  :
> Both the meta tags from the existing ParseResult and from the HTMLMetaTags 
> are added to the metadata with a "metatag." prefix. But the ParseResult 
> object already contains the HTML meta tags, because they have been added by 
> TikaParser here: 
> [TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206]
>  
> This bug is concerning, because it makes the segments uselessly big, 
> especially if we want to store all metatags (by default, only 
> metatag.description and metatag.keywords are stored, and thus duplicated).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to