RE: Question XMP metadata extraction

Robin Diederen Tue, 27 Oct 2009 13:16:17 -0700

Hi Jeremias and Andreas,

Thanks for the support; I've been toying quite a bit with PDFbox and I'm 
finally getting somewhere. I did not know about the objects / schemas which can 
contain metadata information (heck two weeks ago I didn't know of XMP metadata 
;-)).


Anyhow.. I finally am getting some results. For this specific project I was 
after the metadata which Adobe reader shows as "description". I learned, by 
printing the raw XMP metadata, that this piece if information was stored in the 
dublin core schema. Nice as that is, by using the getDescription method from 
the Dublin core schema, I did not get any results. However, by using the 
getTextProperty("dc:description"), I get all the data I am after. 

I do not have any clue why the getDescription call does not return anything but 
I guess it's bug of somekind. 

Thanks for all your help!

Best, Robin

-----Oorspronkelijk bericht-----
Van: Jeremias Maerki [mailto:d...@jeremias-maerki.ch] 
Verzonden: zaterdag 24 oktober 2009 14:52
Aan: pdfbox-users@incubator.apache.org
Onderwerp: Re: Question XMP metadata extraction

I've just added an example that shows how to extract document-level XMP
metadata:
http://svn.apache.org/viewvc?rev=829357&view=rev

As Andreas noted, PDF supports attaching metadata to many different objects 
(including pages, XObjects, fonts etc.). The most interesting packet will 
certainly be that attached to the document catalog. I hope the new example will 
help you solve your requirement, Robin.

On 23.10.2009 12:35:17 Andreas Lehmkühler wrote:
> Hi,
> 
> Gesendet: Do, 22. Okt 2009 Von: Robin Diederen<diede...@nlcom.nl>
> 
> > Hi,
> > 
> > Thanks for looking into the code; I'm a bit confused though. I guess 
> > it's your suggestion to inspect the three locations for metadata "by 
> > hand"?  What would be the best way to proceed?
> As I've already said I'm not a XMP expert, I just try to find possible 
> locations where metadata are used within pdfbox.
> 
> PDPage-metadata:
> - load the document
> - get all pages calling document.getDocumentCatalog().getAllPages()
> - iterate through all pages and check them for metadata calling 
> getMetadata()
> 
> PDXObject:
> - load the document
> - get all pages calling document.getDocumentCatalog().getAllPages()
> - iterate through all pages and get all XObjects by calling 
> getXObjects()
> - iterate through all XObjects and check them for metadata calling 
> getMetadata()
> 
> I don't know if that really works, but give it a try.
> 
> BR
> Andreas Lehmkühler
> > 
> > Best, Robin
> >  
> > -----Original message-----
> > From: Andreas Lehmkühler <andr...@lehmi.de>
> > Sent: Thu 22-10-2009 22:36
> > To: pdfbox-users@incubator.apache.org;
> > Subject: Re: Question XMP metadata extraction
> > 
> > 
> > Robin Diederen schrieb:
> > > Andreas,
> > > 
> > > According to the JavaDoc
> > (http://www.pdfbox.org/javadoc/org/pdfbox/pdmodel/common/PDMetadata.
> > html#PDM
> > etadata%28org.pdfbox.pdmodel.PDDocument%29) the extractxmpmetadata 
> > method should be able to do this. Or am I missing something?
> > Ok, I had a deeper look and it seems that there are 3 supported 
> > locations for metadata within pdfbox: PDDocumentCatalog, PDPage and 
> > PDXObject. The "classic" metadata are located in the catalog. 
> > Perhaps you will find the metadata your are looking for in the two other 
> > objects?
> > 
> > BR
> > Andreas Lehmkühler
> > 
> > > Thanks for your help, greatly appreciated!
> > > 
> > >  
> > > 
> > > Best, Robin
> > >  
> > > -----Original message-----
> > > From: Andreas Lehmkühler <andr...@lehmi.de>
> > > Sent: Thu 22-10-2009 22:09
> > > To: pdfbox-users@incubator.apache.org;
> > > Subject: Re: Question XMP metadata extraction
> > > 
> > > Hi,
> > > 
> > > Robin Diederen schrieb:
> > >> Hello Andreas,
> > >>
> > >> I did have a look at the PrintDocumentMetaData.java fille; there 
> > >> I find
> > that using the PDDocumentInformation metadata is extracted. This 
> > code is useful for PDF files with "classic" metadata, but not for 
> > PDF files only carrying XMP metadata, right?
> > > OK, I see. I'm not that familiar with the XMP stuff, but I guess I 
> > > understand your problem.
> > > 
> > >> There's my issue.. as soon as I have a PDF file with only XMP 
> > >> metadata I
> > need some other way to extract this metadata..
> > > I'm afraid that pdfbox is yet limited to the handling of "classic"
> > metadata.
> > > 
> > > 
> > >> Best, Robin
> > >>  
> > >> -----Original message-----
> > >> From: Andreas Lehmkühler <andr...@lehmi.de>
> > >> Sent: Thu 22-10-2009 21:47
> > >> To: pdfbox-users@incubator.apache.org;
> > >> Subject: Re: Question XMP metadata extraction
> > >>
> > >> Hi,
> > >>
> > >> Robin Diederen schrieb:
> > >>> Hello all,
> > >>>
> > >>> I'm quite new to PDFbox and currently figuring out how to 
> > >>> extract
> > metadata from PDF files which is in XMP format.
> > >>>
> > >>> I have a few files containing XMP metadata, but I can not get 
> > >>> any of
> > those to work. And I can't seem to figure out where I am failing.
> > >>>
> > >>> A code snippet (all non-relevant code was deleted):
> > >>>
> > >>> String inputFile = "/some/file.pdf"
> > >>>
> > >>> PDDocument pdfDocument = null;
> > >>> pdfDocument = new PDDocument();
> > >>> pdfDocument = PDDocument.load(inputFile); PDMetadata pdfMetaData 
> > >>> = new PDMetadata(pdfDocument);
> > >>>             
> > >>> int metadataLength = pdfMetaData.getLength(); 
> > >>> System.out.println(pdfMetaData.getLength());
> > >>>  
> > >>>
> > >>> pdfMetaData.exportXMPMetadata();
> > >>>  
> > >>>
> > >>> The getLength call always returns 0; the exportXMPMetadata call 
> > >>> returns
> > an error:
> > >>>
> > >>> [Fatal Error] :-1:-1: Premature end of file.
> > >>> Exception in thread "main" java.io.IOException: Premature end of file.
> > >>>     at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:78)
> > >>>     at 
> > >>> org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554)
> > >>>     at
> > org.apache.pdfbox.pdmodel.common.PDMetadata.exportXMPMetadata(PDMeta
> > data.jav
> > a:86)
> > >>>     at
> > com.robindiederen.pdf.Extractor.extractMetaDataFromXMP(Extractor.jav
> > a:124)
> > >>>     at com.robindiederen.pdf.Extractor.main(Extractor.java:90)
> > >>>
> > >>>  
> > >>>
> > >>> This happens for every PDF I test. Extracting metadata from the
> > DocumentInformation table works as a charm. I'm using PDFbox 0.80 on 
> > Java 1.5.
> > >> Have a look at PrintDocumentMetaData as an example how to extract 
> > >> the docs metadata.
> > >>
> > >> HTH
> > >> Andreas Lehmkühler
> > >>
> > >>
> > > BR
> > > Andreas Lehmkühler
> > > 
> > > 
> > > 
> > 
> > 
> > 
> 
> --- original Nachricht Ende ----




Jeremias Maerki

RE: Question XMP metadata extraction

Reply via email to