Re: [jira] [Commented] (PDFBOX-2201) getKeywords returns null although keywords are present

Leonard Rosenthol Sat, 12 Jul 2014 19:53:21 -0700

There is nothing magic about how Acrobat/Reader goes from XMP to DocInfo
(and vice-versa).  It is documented in our own specs (the XMP specs,as you
point to) as well as being standardized in the PDF/A and PDF/X standards
from ISO.


Leonard

On 7/12/14, 7:35 AM, "Tilman Hausherr (JIRA)" <[email protected]> wrote:

>
>    [ 
>https://issues.apache.org/jira/browse/PDFBOX-2201?page=com.atlassian.jira.
>plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059752#co
>mment-14059752 ] 
>
>Tilman Hausherr commented on PDFBOX-2201:
>-----------------------------------------
>
>It is more complex - there are two schemas.
>http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf
>The "Dublin Core Schema" and the "Adobe PDF Schema". Acrobat viewer
>apparently merges for its dialogbox. (When first opening, the keywords is
>empty; after clicking on the yellow warning, it appears)
>
>You can open the file with NOTEPAD++, the schemas are named
>http://ns.adobe.com/pdf/1.3
>http://purl.org/dc/elements/1.1
>
>To get the keywords, use code like this:
>{code}
>        PDDocument document = PDDocument.loadNonSeq(new
>File("Roland_Berger_TAB_Industry_4_0.pdf"), null);
>        PDDocumentCatalog catalog = document.getDocumentCatalog();
>        PDMetadata meta = catalog.getMetadata();
>        if (meta != null)
>        {
>            DocumentBuilderFactory dbf =
>DocumentBuilderFactory.newInstance();
>            DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
>            Document xmpDocument =
>documentBuilder.parse(meta.createInputStream());
>            XMPMetadata metadata = new XMPMetadata(xmpDocument);
>            XMPSchemaDublinCore dc = metadata.getDublinCoreSchema();
>            if (dc != null)
>                System.out.println(dc.getSubjects());
>        }
>{code}
>(yes, the keywords are named "subjects"!)
>and you get
>Roland Berger Strategy Consultants, Consulting, think act, manufacturing
>industry, engineered products, Europe
>
>or get the file ExtractMetadata.java from the source distribution.
>(Which, to add to the confusion, doesn't use the getSubjects() call).
>
>> getKeywords returns null although keywords are present
>> ------------------------------------------------------
>>
>>                 Key: PDFBOX-2201
>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2201
>>             Project: PDFBox
>>          Issue Type: Bug
>>          Components: PDModel
>>    Affects Versions: 1.8.5
>>         Environment: Win64
>>            Reporter: Walter Kehl
>>            Priority: Minor
>>         Attachments: Roland_Berger_TAB_Industry_4_0.pdf
>>
>>
>> When accessing a PDF document which clearly has keywords in its meta
>>data , the function call
>> PDDocumentInformation documentInfo = document.getDocumentInformation();
>> String info = documentInfo.getKeywords();
>> returns null. 
>> Sample PDF is attached.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.2#6252)

Re: [jira] [Commented] (PDFBOX-2201) getKeywords returns null although keywords are present

Reply via email to