[jira] [Comment Edited] (PDFBOX-2201) getKeywords returns null although keywords are present

Tilman Hausherr (JIRA) Sat, 12 Jul 2014 04:51:24 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059752#comment-14059752
 ]


Tilman Hausherr edited comment on PDFBOX-2201 at 7/12/14 11:49 AM:
-------------------------------------------------------------------

It is more complex - there is the /info segment and then there are two schemas. 

The info segment, which is what you get, is this in the (uncompressed) file:

<< /Author (Roland Berger Strategy Consultants) /CreationDate 
(D:20140425173834+02'00') /Creator (Adobe InDesign CS6 \(Macintosh\)) /ModDate 
(D:20140428101542+02'00') /Producer (Adobe PDF Library 10.0.1) /Subject (New 
industrial revolution in Europe; increasing share in the manufacturing 
industry) /Title (THINK ACT Industry 4.0  The new industrial revolution \205 
How Europe will succeed) /Trapped /False >>

And there are the schemas:
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf
The "Dublin Core Schema" and the "Adobe PDF Schema". 

Acrobat viewer apparently merges all of this for its dialogbox. (When first 
opening, the keywords is empty; after clicking on the yellow warning, it 
appears)

You can open the compressed file with NOTEPAD++, you won't see the /info 
segment, but you will find the schemas, they are named 
http://ns.adobe.com/pdf/1.3
http://purl.org/dc/elements/1.1

To get the keywords, use code like this:
{code}
        PDDocument document = PDDocument.loadNonSeq(new 
File("Roland_Berger_TAB_Industry_4_0.pdf"), null);
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        PDMetadata meta = catalog.getMetadata();
        if (meta != null)
        {
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
            Document xmpDocument = 
documentBuilder.parse(meta.createInputStream());
            XMPMetadata metadata = new XMPMetadata(xmpDocument);
            XMPSchemaDublinCore dc = metadata.getDublinCoreSchema();
            if (dc != null)
                System.out.println(dc.getSubjects());
        }
{code}
(yes, the keywords are named "subjects"!)
and you get
Roland Berger Strategy Consultants, Consulting, think act, manufacturing 
industry, engineered products, Europe

or get the file ExtractMetadata.java from the source distribution. (Which, to 
add to the confusion, doesn't use the getSubjects() call).


was (Author: tilman):
It is more complex - there are two schemas. 
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf
The "Dublin Core Schema" and the "Adobe PDF Schema". Acrobat viewer apparently 
merges for its dialogbox. (When first opening, the keywords is empty; after 
clicking on the yellow warning, it appears)

You can open the file with NOTEPAD++, the schemas are named 
http://ns.adobe.com/pdf/1.3
http://purl.org/dc/elements/1.1

To get the keywords, use code like this:
{code}
        PDDocument document = PDDocument.loadNonSeq(new 
File("Roland_Berger_TAB_Industry_4_0.pdf"), null);
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        PDMetadata meta = catalog.getMetadata();
        if (meta != null)
        {
            DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
            DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
            Document xmpDocument = 
documentBuilder.parse(meta.createInputStream());
            XMPMetadata metadata = new XMPMetadata(xmpDocument);
            XMPSchemaDublinCore dc = metadata.getDublinCoreSchema();
            if (dc != null)
                System.out.println(dc.getSubjects());
        }
{code}
(yes, the keywords are named "subjects"!)
and you get
Roland Berger Strategy Consultants, Consulting, think act, manufacturing 
industry, engineered products, Europe

or get the file ExtractMetadata.java from the source distribution. (Which, to 
add to the confusion, doesn't use the getSubjects() call).

> getKeywords returns null although keywords are present
> ------------------------------------------------------
>
>                 Key: PDFBOX-2201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.5
>         Environment: Win64
>            Reporter: Walter Kehl
>            Priority: Minor
>         Attachments: Roland_Berger_TAB_Industry_4_0.pdf
>
>
> When accessing a PDF document which clearly has keywords in its meta data , 
> the function call 
> PDDocumentInformation documentInfo = document.getDocumentInformation();
> String info = documentInfo.getKeywords();
> returns null. 
> Sample PDF is attached. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (PDFBOX-2201) getKeywords returns null although keywords are present

Reply via email to