[ https://issues.apache.org/jira/browse/NUTCH-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gerard Bouchar updated NUTCH-2567: ---------------------------------- Description: Using nutch witch the following configuration, MetaTagsParser writes HTML meta tags to the metadata twice: {code:java} <property> <name>plugin.includes</name> <value>protocol-http|parse-(tika|metatags)</value> </property> {code} The problem seems to come from [MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111] : Both the meta tags from the existing ParseResult and from the HTMLMetaTags are added to the metadata with a "metatag." prefix. But the ParseResult object already contains the HTML meta tags, because they have been added by TikaParser here: [TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206] This bug is concerning, because it makes the segments uselessly big, especially if we want to store all metatags (by default, only metatag.description and metatag.keywords are stored, and thus duplicated). I would also suggest making the output of [Metadata::toString |https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/metadata/Metadata.java#L235-L245] more readable(for instance by adding a newline before each new metadata value). It would have made this bug way easier to spot inside the output of the parsechecker. was: Using nutch witch the following configuration, MetaTagsParser writes HTML meta tags to the metadata twice: {code:java} <property> <name>plugin.includes</name> <value>protocol-http|parse-(tika|metatags)</value> </property> {code} The problem seems to come from [MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111] : Both the meta tags from the existing ParseResult and from the HTMLMetaTags are added to the metadata with a "metatag." prefix. But the ParseResult object already contains the HTML meta tags, because they have been added by TikaParser here: [TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206] This bug is concerning, because it makes the segments uselessly big, especially if we want to store all metatags (by default, only metatag.description and metatag.keywords are stored, and thus duplicated). I would also suggest making > parse-metatags writes every meta tags twice > ------------------------------------------- > > Key: NUTCH-2567 > URL: https://issues.apache.org/jira/browse/NUTCH-2567 > Project: Nutch > Issue Type: Bug > Reporter: Gerard Bouchar > Priority: Major > > Using nutch witch the following configuration, MetaTagsParser writes HTML > meta tags to the metadata twice: > {code:java} > <property> > <name>plugin.includes</name> > <value>protocol-http|parse-(tika|metatags)</value> > </property> > {code} > The problem seems to come from > [MetaTagsParser.java#L104-L111|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-metatags/src/java/org/apache/nutch/parse/metatags/MetaTagsParser.java#L104-L111] > : > Both the meta tags from the existing ParseResult and from the HTMLMetaTags > are added to the metadata with a "metatag." prefix. But the ParseResult > object already contains the HTML meta tags, because they have been added by > TikaParser here: > [TikaParser.java#L198-L206|https://github.com/apache/nutch/blob/929fc9c89afb9267e3116cdca874dbcf9e511430/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L198-L206] > > This bug is concerning, because it makes the segments uselessly big, > especially if we want to store all metatags (by default, only > metatag.description and metatag.keywords are stored, and thus duplicated). > I would also suggest making the output of [Metadata::toString > |https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/metadata/Metadata.java#L235-L245] > more readable(for instance by adding a newline before each new metadata > value). It would have made this bug way easier to spot inside the output of > the parsechecker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)