[ 
https://issues.apache.org/jira/browse/NUTCH-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Bouchar updated NUTCH-2567:
----------------------------------
    Description: 
Using nutch witch the following configuration, MetaTagsParser writes HTML meta 
tags to the metadata twice:
{code:java}
    <property>
        <name>plugin.includes</name>
        <value>protocol-http|parse-(tika|metatags)</value>
    </property>
{code}
The problem seems to come from 
[MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111]
 :

Both the meta tags from the existing ParseResult and from the HTMLMetaTags are 
added to the metadata with a "metatag." prefix. But the ParseResult object 
already contains the HTML meta tags, because they have been added by TikaParser 
here: 
[TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206]

 
 This bug is concerning, because it makes the segments uselessly big, 
especially if we want to store all metatags (by default, only 
metatag.description and metatag.keywords are stored, and thus duplicated).

I would also suggest making the output of [Metadata::toString
|https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/metadata/Metadata.java#L235-L245]
 more readable(for instance by adding a newline before each new metadata 
value). It would have made this bug way easier to spot inside the output of the 
parsechecker. 

  was:
Using nutch witch the following configuration, MetaTagsParser writes HTML meta 
tags to the metadata twice:
{code:java}
    <property>
        <name>plugin.includes</name>
        <value>protocol-http|parse-(tika|metatags)</value>
    </property>
{code}
The problem seems to come from 
[MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111]
 :

Both the meta tags from the existing ParseResult and from the HTMLMetaTags are 
added to the metadata with a "metatag." prefix. But the ParseResult object 
already contains the HTML meta tags, because they have been added by TikaParser 
here: 
[TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206]

 
This bug is concerning, because it makes the segments uselessly big, especially 
if we want to store all metatags (by default, only metatag.description and 
metatag.keywords are stored, and thus duplicated).

I would also suggest making 


> parse-metatags writes every meta tags twice
> -------------------------------------------
>
>                 Key: NUTCH-2567
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2567
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> Using nutch witch the following configuration, MetaTagsParser writes HTML 
> meta tags to the metadata twice:
> {code:java}
>     <property>
>         <name>plugin.includes</name>
>         <value>protocol-http|parse-(tika|metatags)</value>
>     </property>
> {code}
> The problem seems to come from 
> [MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111]
>  :
> Both the meta tags from the existing ParseResult and from the HTMLMetaTags 
> are added to the metadata with a "metatag." prefix. But the ParseResult 
> object already contains the HTML meta tags, because they have been added by 
> TikaParser here: 
> [TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206]
>  
>  This bug is concerning, because it makes the segments uselessly big, 
> especially if we want to store all metatags (by default, only 
> metatag.description and metatag.keywords are stored, and thus duplicated).
> I would also suggest making the output of [Metadata::toString
> |https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/metadata/Metadata.java#L235-L245]
>  more readable(for instance by adding a newline before each new metadata 
> value). It would have made this bug way easier to spot inside the output of 
> the parsechecker. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to