I've just added an example that shows how to extract document-level XMP metadata: http://svn.apache.org/viewvc?rev=829357&view=rev
As Andreas noted, PDF supports attaching metadata to many different objects (including pages, XObjects, fonts etc.). The most interesting packet will certainly be that attached to the document catalog. I hope the new example will help you solve your requirement, Robin. On 23.10.2009 12:35:17 Andreas Lehmkühler wrote: > Hi, > > Gesendet: Do, 22. Okt 2009 Von: Robin Diederen<diede...@nlcom.nl> > > > Hi, > > > > Thanks for looking into the code; I'm a bit confused though. I guess it's > > your suggestion to inspect the three locations for metadata "by hand"? What > > would be the best way to proceed? > As I've already said I'm not a XMP expert, I just try to find possible > locations where metadata are used within pdfbox. > > PDPage-metadata: > - load the document > - get all pages calling document.getDocumentCatalog().getAllPages() > - iterate through all pages and check them for metadata calling getMetadata() > > PDXObject: > - load the document > - get all pages calling document.getDocumentCatalog().getAllPages() > - iterate through all pages and get all XObjects by calling getXObjects() > - iterate through all XObjects and check them for metadata calling > getMetadata() > > I don't know if that really works, but give it a try. > > BR > Andreas Lehmkühler > > > > Best, Robin > > > > -----Original message----- > > From: Andreas Lehmkühler <andr...@lehmi.de> > > Sent: Thu 22-10-2009 22:36 > > To: pdfbox-users@incubator.apache.org; > > Subject: Re: Question XMP metadata extraction > > > > > > Robin Diederen schrieb: > > > Andreas, > > > > > > According to the JavaDoc > > (http://www.pdfbox.org/javadoc/org/pdfbox/pdmodel/common/PDMetadata.html#PDM > > etadata%28org.pdfbox.pdmodel.PDDocument%29) the extractxmpmetadata method > > should be able to do this. Or am I missing something? > > Ok, I had a deeper look and it seems that there are 3 supported > > locations for metadata within pdfbox: PDDocumentCatalog, PDPage and > > PDXObject. The "classic" metadata are located in the catalog. Perhaps > > you will find the metadata your are looking for in the two other objects? > > > > BR > > Andreas Lehmkühler > > > > > Thanks for your help, greatly appreciated! > > > > > > > > > > > > Best, Robin > > > > > > -----Original message----- > > > From: Andreas Lehmkühler <andr...@lehmi.de> > > > Sent: Thu 22-10-2009 22:09 > > > To: pdfbox-users@incubator.apache.org; > > > Subject: Re: Question XMP metadata extraction > > > > > > Hi, > > > > > > Robin Diederen schrieb: > > >> Hello Andreas, > > >> > > >> I did have a look at the PrintDocumentMetaData.java fille; there I find > > that using the PDDocumentInformation metadata is extracted. This code is > > useful for PDF files with "classic" metadata, but not for PDF files only > > carrying XMP metadata, right? > > > OK, I see. I'm not that familiar with the XMP stuff, but I guess I > > > understand your problem. > > > > > >> There's my issue.. as soon as I have a PDF file with only XMP metadata I > > need some other way to extract this metadata.. > > > I'm afraid that pdfbox is yet limited to the handling of "classic" > > metadata. > > > > > > > > >> Best, Robin > > >> > > >> -----Original message----- > > >> From: Andreas Lehmkühler <andr...@lehmi.de> > > >> Sent: Thu 22-10-2009 21:47 > > >> To: pdfbox-users@incubator.apache.org; > > >> Subject: Re: Question XMP metadata extraction > > >> > > >> Hi, > > >> > > >> Robin Diederen schrieb: > > >>> Hello all, > > >>> > > >>> I'm quite new to PDFbox and currently figuring out how to extract > > metadata from PDF files which is in XMP format. > > >>> > > >>> I have a few files containing XMP metadata, but I can not get any of > > those to work. And I can't seem to figure out where I am failing. > > >>> > > >>> A code snippet (all non-relevant code was deleted): > > >>> > > >>> String inputFile = "/some/file.pdf" > > >>> > > >>> PDDocument pdfDocument = null; > > >>> pdfDocument = new PDDocument(); > > >>> pdfDocument = PDDocument.load(inputFile); > > >>> PDMetadata pdfMetaData = new PDMetadata(pdfDocument); > > >>> > > >>> int metadataLength = pdfMetaData.getLength(); > > >>> System.out.println(pdfMetaData.getLength()); > > >>> > > >>> > > >>> pdfMetaData.exportXMPMetadata(); > > >>> > > >>> > > >>> The getLength call always returns 0; the exportXMPMetadata call returns > > an error: > > >>> > > >>> [Fatal Error] :-1:-1: Premature end of file. > > >>> Exception in thread "main" java.io.IOException: Premature end of file. > > >>> at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:78) > > >>> at org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554) > > >>> at > > org.apache.pdfbox.pdmodel.common.PDMetadata.exportXMPMetadata(PDMetadata.jav > > a:86) > > >>> at > > com.robindiederen.pdf.Extractor.extractMetaDataFromXMP(Extractor.java:124) > > >>> at com.robindiederen.pdf.Extractor.main(Extractor.java:90) > > >>> > > >>> > > >>> > > >>> This happens for every PDF I test. Extracting metadata from the > > DocumentInformation table works as a charm. I'm using PDFbox 0.80 on Java > > 1.5. > > >> Have a look at PrintDocumentMetaData as an example how to extract the > > >> docs metadata. > > >> > > >> HTH > > >> Andreas Lehmkühler > > >> > > >> > > > BR > > > Andreas Lehmkühler > > > > > > > > > > > > > > > > > --- original Nachricht Ende ---- Jeremias Maerki