Robin, could you just post the XMP packet (or the full PDF if that's easier) here? That way we could test why this happens.
On 27.10.2009 21:15:49 Robin Diederen wrote: > Hi Jeremias and Andreas, > > Thanks for the support; I've been toying quite a bit with PDFbox and I'm > finally getting somewhere. I did not know about the objects / schemas which > can contain metadata information (heck two weeks ago I didn't know of XMP > metadata ;-)). > > Anyhow.. I finally am getting some results. For this specific project I was > after the metadata which Adobe reader shows as "description". I learned, by > printing the raw XMP metadata, that this piece if information was stored in > the dublin core schema. Nice as that is, by using the getDescription method > from the Dublin core schema, I did not get any results. However, by using the > getTextProperty("dc:description"), I get all the data I am after. > > I do not have any clue why the getDescription call does not return anything > but I guess it's bug of somekind. > > Thanks for all your help! > > Best, Robin > > -----Oorspronkelijk bericht----- > Van: Jeremias Maerki [mailto:d...@jeremias-maerki.ch] > Verzonden: zaterdag 24 oktober 2009 14:52 > Aan: pdfbox-users@incubator.apache.org > Onderwerp: Re: Question XMP metadata extraction > > I've just added an example that shows how to extract document-level XMP > metadata: > http://svn.apache.org/viewvc?rev=829357&view=rev > > As Andreas noted, PDF supports attaching metadata to many different objects > (including pages, XObjects, fonts etc.). The most interesting packet will > certainly be that attached to the document catalog. I hope the new example > will help you solve your requirement, Robin. > > On 23.10.2009 12:35:17 Andreas Lehmkühler wrote: > > Hi, > > > > Gesendet: Do, 22. Okt 2009 Von: Robin Diederen<diede...@nlcom.nl> > > > > > Hi, > > > > > > Thanks for looking into the code; I'm a bit confused though. I guess > > > it's your suggestion to inspect the three locations for metadata "by > > > hand"? What would be the best way to proceed? > > As I've already said I'm not a XMP expert, I just try to find possible > > locations where metadata are used within pdfbox. > > > > PDPage-metadata: > > - load the document > > - get all pages calling document.getDocumentCatalog().getAllPages() > > - iterate through all pages and check them for metadata calling > > getMetadata() > > > > PDXObject: > > - load the document > > - get all pages calling document.getDocumentCatalog().getAllPages() > > - iterate through all pages and get all XObjects by calling > > getXObjects() > > - iterate through all XObjects and check them for metadata calling > > getMetadata() > > > > I don't know if that really works, but give it a try. > > > > BR > > Andreas Lehmkühler > > > > > > Best, Robin > > > > > > -----Original message----- > > > From: Andreas Lehmkühler <andr...@lehmi.de> > > > Sent: Thu 22-10-2009 22:36 > > > To: pdfbox-users@incubator.apache.org; > > > Subject: Re: Question XMP metadata extraction > > > > > > > > > Robin Diederen schrieb: > > > > Andreas, > > > > > > > > According to the JavaDoc > > > (http://www.pdfbox.org/javadoc/org/pdfbox/pdmodel/common/PDMetadata. > > > html#PDM > > > etadata%28org.pdfbox.pdmodel.PDDocument%29) the extractxmpmetadata > > > method should be able to do this. Or am I missing something? > > > Ok, I had a deeper look and it seems that there are 3 supported > > > locations for metadata within pdfbox: PDDocumentCatalog, PDPage and > > > PDXObject. The "classic" metadata are located in the catalog. > > > Perhaps you will find the metadata your are looking for in the two other > > > objects? > > > > > > BR > > > Andreas Lehmkühler > > > > > > > Thanks for your help, greatly appreciated! > > > > > > > > > > > > > > > > Best, Robin > > > > > > > > -----Original message----- > > > > From: Andreas Lehmkühler <andr...@lehmi.de> > > > > Sent: Thu 22-10-2009 22:09 > > > > To: pdfbox-users@incubator.apache.org; > > > > Subject: Re: Question XMP metadata extraction > > > > > > > > Hi, > > > > > > > > Robin Diederen schrieb: > > > >> Hello Andreas, > > > >> > > > >> I did have a look at the PrintDocumentMetaData.java fille; there > > > >> I find > > > that using the PDDocumentInformation metadata is extracted. This > > > code is useful for PDF files with "classic" metadata, but not for > > > PDF files only carrying XMP metadata, right? > > > > OK, I see. I'm not that familiar with the XMP stuff, but I guess I > > > > understand your problem. > > > > > > > >> There's my issue.. as soon as I have a PDF file with only XMP > > > >> metadata I > > > need some other way to extract this metadata.. > > > > I'm afraid that pdfbox is yet limited to the handling of "classic" > > > metadata. > > > > > > > > > > > >> Best, Robin > > > >> > > > >> -----Original message----- > > > >> From: Andreas Lehmkühler <andr...@lehmi.de> > > > >> Sent: Thu 22-10-2009 21:47 > > > >> To: pdfbox-users@incubator.apache.org; > > > >> Subject: Re: Question XMP metadata extraction > > > >> > > > >> Hi, > > > >> > > > >> Robin Diederen schrieb: > > > >>> Hello all, > > > >>> > > > >>> I'm quite new to PDFbox and currently figuring out how to > > > >>> extract > > > metadata from PDF files which is in XMP format. > > > >>> > > > >>> I have a few files containing XMP metadata, but I can not get > > > >>> any of > > > those to work. And I can't seem to figure out where I am failing. > > > >>> > > > >>> A code snippet (all non-relevant code was deleted): > > > >>> > > > >>> String inputFile = "/some/file.pdf" > > > >>> > > > >>> PDDocument pdfDocument = null; > > > >>> pdfDocument = new PDDocument(); > > > >>> pdfDocument = PDDocument.load(inputFile); PDMetadata pdfMetaData > > > >>> = new PDMetadata(pdfDocument); > > > >>> > > > >>> int metadataLength = pdfMetaData.getLength(); > > > >>> System.out.println(pdfMetaData.getLength()); > > > >>> > > > >>> > > > >>> pdfMetaData.exportXMPMetadata(); > > > >>> > > > >>> > > > >>> The getLength call always returns 0; the exportXMPMetadata call > > > >>> returns > > > an error: > > > >>> > > > >>> [Fatal Error] :-1:-1: Premature end of file. > > > >>> Exception in thread "main" java.io.IOException: Premature end of file. > > > >>> at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:78) > > > >>> at > > > >>> org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554) > > > >>> at > > > org.apache.pdfbox.pdmodel.common.PDMetadata.exportXMPMetadata(PDMeta > > > data.jav > > > a:86) > > > >>> at > > > com.robindiederen.pdf.Extractor.extractMetaDataFromXMP(Extractor.jav > > > a:124) > > > >>> at com.robindiederen.pdf.Extractor.main(Extractor.java:90) > > > >>> > > > >>> > > > >>> > > > >>> This happens for every PDF I test. Extracting metadata from the > > > DocumentInformation table works as a charm. I'm using PDFbox 0.80 on > > > Java 1.5. > > > >> Have a look at PrintDocumentMetaData as an example how to extract > > > >> the docs metadata. > > > >> > > > >> HTH > > > >> Andreas Lehmkühler > > > >> > > > >> > > > > BR > > > > Andreas Lehmkühler > > > > > > > > > > > > > > > > > > > > > > > > > --- original Nachricht Ende ---- > > > > > Jeremias Maerki > > > Jeremias Maerki