[
https://issues.apache.org/jira/browse/TIKA-579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tyler Palsulich updated TIKA-579:
---------------------------------
Affects Version/s: (was: 0.8)
1.8
> DcXMLParser: DC metadata text in extracted body
> -----------------------------------------------
>
> Key: TIKA-579
> URL: https://issues.apache.org/jira/browse/TIKA-579
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.8
> Environment: N/A
> Reporter: Scott Severtson
>
> The DcXMLParser correctly extracts Dublin Core metadata text into the
> Metadata object, but the metadata text is included in the extracted "body".
> Sample XML document:
> ---
> <?xml version="1.0" encoding="UTF-8"?>
> <a xmlns:dc="http://purl.org/dc/elements/1.1/">
> <dc:title>This is the title</dc:title>
> <dc:creator>Scott Severtson</dc:creator>
> <dc:subject>This is the subject</dc:subject>
> <b>This is the body text.</b>
> </a>
> ---
> Sample code:
> ---
> URL xmlDocument = ...
> TikaConfig tikaConfig = new TikaConfig();
> ParseUtils.getStringContent(xmlDocument, tikaConfig, "application/xml");
> ---
> Actual output:
> ---
> This is the title
> Scott Severtson
> This is the subject
> This is the body text.
> ---
> Expected output:
> ---
> This is the body text.
> ---
> The output is consistent when using ParseUtils *and* when using DcXMLParser
> directly with a ContentHandler. The ContentHandler receives a single text
> node containing concatinated metadata and body text, so there is no
> opportunity to externally work around this issue. We would expect DcXMLParser
> to remove DC nodes from the body prior to extracting the body text, to be
> more consistent with other Tika parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)