[
https://issues.apache.org/jira/browse/TIKA-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2057.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.14
2.0
Fixed. Thank you for opening this.
> Extract PDF DocInfo fields into separate metadata fields
> --------------------------------------------------------
>
> Key: TIKA-2057
> URL: https://issues.apache.org/jira/browse/TIKA-2057
> Project: Tika
> Issue Type: Improvement
> Components: metadata
> Affects Versions: 1.13
> Reporter: John Haynes
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 2.0, 1.14
>
> Attachments: int_Consumer_Conditions_of_use.pdf
>
>
> Hi,
> I have a PDF in which title has been set twice -- once as Dublin core
> metadata: {code}<dc:title>
> <rdf:Alt>
> <rdf:li xml:lang="x-default">
> Consumer credit cards - conditions of use
> </rdf:li>
> </rdf:Alt>
> </dc:title>{code}
> and again in the PDF DocInfo section: {code}
> /Title(Consumer Credit Card - Conditions of Use){code}
> When I use Tika to transform the PDF into HTML {code}java -jar
> tika-app-1.13.jar int_Consumer_Conditions_of_use.pdf{code} it outputs this
> metadata: {code}<meta name="dc:title" content="Consumer credit cards -
> conditions of use"/>{code} and this <title> tag: {code}<title>Consumer credit
> cards - conditions of use</title>{code} meaning we no longer have access to
> the DocInfo title.
> Is there some way you could adapt Tika to copy this PDF DocInfo forward
> during a conversion under a new type of metadata, e.g. {code}
> <meta name="docinfo:title" content="Consumer Credit Card - Conditions of
> Use"/>{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)