[jira] [Commented] (PDFBOX-2201) getKeywords returns null although keywords are present

Tilman Hausherr (JIRA) Sat, 12 Jul 2014 04:35:30 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059752#comment-14059752
 ]


Tilman Hausherr commented on PDFBOX-2201:
-----------------------------------------

It is more complex - there are two schemas. 
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf
The "Dublin Core Schema" and the "Adobe PDF Schema". Acrobat viewer apparently 
merges for its dialogbox. (When first opening, the keywords is empty; after 
clicking on the yellow warning, it appears)

You can open the file with NOTEPAD++, the schemas are named 
http://ns.adobe.com/pdf/1.3
http://purl.org/dc/elements/1.1

To get the keywords, use code like this:
{code}
        PDDocument document = PDDocument.loadNonSeq(new 
File("Roland_Berger_TAB_Industry_4_0.pdf"), null);
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        PDMetadata meta = catalog.getMetadata();
        if (meta != null)
        {
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
            Document xmpDocument = 
documentBuilder.parse(meta.createInputStream());
            XMPMetadata metadata = new XMPMetadata(xmpDocument);
            XMPSchemaDublinCore dc = metadata.getDublinCoreSchema();
            if (dc != null)
                System.out.println(dc.getSubjects());
        }
{code}
(yes, the keywords are named "subjects"!)
and you get
Roland Berger Strategy Consultants, Consulting, think act, manufacturing 
industry, engineered products, Europe

or get the file ExtractMetadata.java from the source distribution. (Which, to 
add to the confusion, doesn't use the getSubjects() call).

> getKeywords returns null although keywords are present
> ------------------------------------------------------
>
>                 Key: PDFBOX-2201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.5
>         Environment: Win64
>            Reporter: Walter Kehl
>            Priority: Minor
>         Attachments: Roland_Berger_TAB_Industry_4_0.pdf
>
>
> When accessing a PDF document which clearly has keywords in its meta data , 
> the function call 
> PDDocumentInformation documentInfo = document.getDocumentInformation();
> String info = documentInfo.getKeywords();
> returns null. 
> Sample PDF is attached. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2201) getKeywords returns null although keywords are present

Reply via email to