There is nothing magic about how Acrobat/Reader goes from XMP to DocInfo (and vice-versa). It is documented in our own specs (the XMP specs,as you point to) as well as being standardized in the PDF/A and PDF/X standards from ISO.
Leonard On 7/12/14, 7:35 AM, "Tilman Hausherr (JIRA)" <[email protected]> wrote: > > [ >https://issues.apache.org/jira/browse/PDFBOX-2201?page=com.atlassian.jira. >plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059752#co >mment-14059752 ] > >Tilman Hausherr commented on PDFBOX-2201: >----------------------------------------- > >It is more complex - there are two schemas. >http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf >The "Dublin Core Schema" and the "Adobe PDF Schema". Acrobat viewer >apparently merges for its dialogbox. (When first opening, the keywords is >empty; after clicking on the yellow warning, it appears) > >You can open the file with NOTEPAD++, the schemas are named >http://ns.adobe.com/pdf/1.3 >http://purl.org/dc/elements/1.1 > >To get the keywords, use code like this: >{code} > PDDocument document = PDDocument.loadNonSeq(new >File("Roland_Berger_TAB_Industry_4_0.pdf"), null); > PDDocumentCatalog catalog = document.getDocumentCatalog(); > PDMetadata meta = catalog.getMetadata(); > if (meta != null) > { > DocumentBuilderFactory dbf = >DocumentBuilderFactory.newInstance(); > DocumentBuilder documentBuilder = dbf.newDocumentBuilder(); > Document xmpDocument = >documentBuilder.parse(meta.createInputStream()); > XMPMetadata metadata = new XMPMetadata(xmpDocument); > XMPSchemaDublinCore dc = metadata.getDublinCoreSchema(); > if (dc != null) > System.out.println(dc.getSubjects()); > } >{code} >(yes, the keywords are named "subjects"!) >and you get >Roland Berger Strategy Consultants, Consulting, think act, manufacturing >industry, engineered products, Europe > >or get the file ExtractMetadata.java from the source distribution. >(Which, to add to the confusion, doesn't use the getSubjects() call). > >> getKeywords returns null although keywords are present >> ------------------------------------------------------ >> >> Key: PDFBOX-2201 >> URL: https://issues.apache.org/jira/browse/PDFBOX-2201 >> Project: PDFBox >> Issue Type: Bug >> Components: PDModel >> Affects Versions: 1.8.5 >> Environment: Win64 >> Reporter: Walter Kehl >> Priority: Minor >> Attachments: Roland_Berger_TAB_Industry_4_0.pdf >> >> >> When accessing a PDF document which clearly has keywords in its meta >>data , the function call >> PDDocumentInformation documentInfo = document.getDocumentInformation(); >> String info = documentInfo.getKeywords(); >> returns null. >> Sample PDF is attached. > > > >-- >This message was sent by Atlassian JIRA >(v6.2#6252)
