[
https://issues.apache.org/jira/browse/PDFBOX-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059752#comment-14059752
]
Tilman Hausherr edited comment on PDFBOX-2201 at 7/12/14 11:49 AM:
-------------------------------------------------------------------
It is more complex - there is the /info segment and then there are two schemas.
The info segment, which is what you get, is this in the (uncompressed) file:
<< /Author (Roland Berger Strategy Consultants) /CreationDate
(D:20140425173834+02'00') /Creator (Adobe InDesign CS6 \(Macintosh\)) /ModDate
(D:20140428101542+02'00') /Producer (Adobe PDF Library 10.0.1) /Subject (New
industrial revolution in Europe; increasing share in the manufacturing
industry) /Title (THINK ACT Industry 4.0 The new industrial revolution \205
How Europe will succeed) /Trapped /False >>
And there are the schemas:
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf
The "Dublin Core Schema" and the "Adobe PDF Schema".
Acrobat viewer apparently merges all of this for its dialogbox. (When first
opening, the keywords is empty; after clicking on the yellow warning, it
appears)
You can open the compressed file with NOTEPAD++, you won't see the /info
segment, but you will find the schemas, they are named
http://ns.adobe.com/pdf/1.3
http://purl.org/dc/elements/1.1
To get the keywords, use code like this:
{code}
PDDocument document = PDDocument.loadNonSeq(new
File("Roland_Berger_TAB_Industry_4_0.pdf"), null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDMetadata meta = catalog.getMetadata();
if (meta != null)
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
Document xmpDocument =
documentBuilder.parse(meta.createInputStream());
XMPMetadata metadata = new XMPMetadata(xmpDocument);
XMPSchemaDublinCore dc = metadata.getDublinCoreSchema();
if (dc != null)
System.out.println(dc.getSubjects());
}
{code}
(yes, the keywords are named "subjects"!)
and you get
Roland Berger Strategy Consultants, Consulting, think act, manufacturing
industry, engineered products, Europe
or get the file ExtractMetadata.java from the source distribution. (Which, to
add to the confusion, doesn't use the getSubjects() call).
was (Author: tilman):
It is more complex - there are two schemas.
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf
The "Dublin Core Schema" and the "Adobe PDF Schema". Acrobat viewer apparently
merges for its dialogbox. (When first opening, the keywords is empty; after
clicking on the yellow warning, it appears)
You can open the file with NOTEPAD++, the schemas are named
http://ns.adobe.com/pdf/1.3
http://purl.org/dc/elements/1.1
To get the keywords, use code like this:
{code}
PDDocument document = PDDocument.loadNonSeq(new
File("Roland_Berger_TAB_Industry_4_0.pdf"), null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDMetadata meta = catalog.getMetadata();
if (meta != null)
{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = dbf.newDocumentBuilder();
Document xmpDocument =
documentBuilder.parse(meta.createInputStream());
XMPMetadata metadata = new XMPMetadata(xmpDocument);
XMPSchemaDublinCore dc = metadata.getDublinCoreSchema();
if (dc != null)
System.out.println(dc.getSubjects());
}
{code}
(yes, the keywords are named "subjects"!)
and you get
Roland Berger Strategy Consultants, Consulting, think act, manufacturing
industry, engineered products, Europe
or get the file ExtractMetadata.java from the source distribution. (Which, to
add to the confusion, doesn't use the getSubjects() call).
> getKeywords returns null although keywords are present
> ------------------------------------------------------
>
> Key: PDFBOX-2201
> URL: https://issues.apache.org/jira/browse/PDFBOX-2201
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 1.8.5
> Environment: Win64
> Reporter: Walter Kehl
> Priority: Minor
> Attachments: Roland_Berger_TAB_Industry_4_0.pdf
>
>
> When accessing a PDF document which clearly has keywords in its meta data ,
> the function call
> PDDocumentInformation documentInfo = document.getDocumentInformation();
> String info = documentInfo.getKeywords();
> returns null.
> Sample PDF is attached.
--
This message was sent by Atlassian JIRA
(v6.2#6252)