Re: Question XMP metadata extraction

Jeremias Maerki Sat, 24 Oct 2009 05:52:10 -0700

I've just added an example that shows how to extract document-level XMP
metadata:
http://svn.apache.org/viewvc?rev=829357&view=rev


As Andreas noted, PDF supports attaching metadata to many different
objects (including pages, XObjects, fonts etc.). The most interesting
packet will certainly be that attached to the document catalog. I hope
the new example will help you solve your requirement, Robin.

On 23.10.2009 12:35:17 Andreas Lehmkühler wrote:
> Hi,
> 
> Gesendet: Do, 22. Okt 2009 Von: Robin Diederen<diede...@nlcom.nl>
> 
> > Hi,
> > 
> > Thanks for looking into the code; I'm a bit confused though. I guess it's
> > your suggestion to inspect the three locations for metadata "by hand"?  What
> > would be the best way to proceed?
> As I've already said I'm not a XMP expert, I just try to find possible 
> locations where metadata are used within pdfbox.
> 
> PDPage-metadata:
> - load the document
> - get all pages calling document.getDocumentCatalog().getAllPages()
> - iterate through all pages and check them for metadata calling getMetadata()
> 
> PDXObject:
> - load the document
> - get all pages calling document.getDocumentCatalog().getAllPages()
> - iterate through all pages and get all XObjects by calling getXObjects()
> - iterate through all XObjects and check them for metadata calling 
> getMetadata()
> 
> I don't know if that really works, but give it a try.
> 
> BR 
> Andreas Lehmkühler
> > 
> > Best, Robin
> >  
> > -----Original message-----
> > From: Andreas Lehmkühler <andr...@lehmi.de>
> > Sent: Thu 22-10-2009 22:36
> > To: pdfbox-users@incubator.apache.org; 
> > Subject: Re: Question XMP metadata extraction
> > 
> > 
> > Robin Diederen schrieb:
> > > Andreas,
> > > 
> > > According to the JavaDoc
> > (http://www.pdfbox.org/javadoc/org/pdfbox/pdmodel/common/PDMetadata.html#PDM
> > etadata%28org.pdfbox.pdmodel.PDDocument%29) the extractxmpmetadata method
> > should be able to do this. Or am I missing something?
> > Ok, I had a deeper look and it seems that there are 3 supported
> > locations for metadata within pdfbox: PDDocumentCatalog, PDPage and
> > PDXObject. The "classic" metadata are located in the catalog. Perhaps
> > you will find the metadata your are looking for in the two other objects?
> > 
> > BR
> > Andreas Lehmkühler
> > 
> > > Thanks for your help, greatly appreciated!
> > > 
> > >  
> > > 
> > > Best, Robin
> > >  
> > > -----Original message-----
> > > From: Andreas Lehmkühler <andr...@lehmi.de>
> > > Sent: Thu 22-10-2009 22:09
> > > To: pdfbox-users@incubator.apache.org; 
> > > Subject: Re: Question XMP metadata extraction
> > > 
> > > Hi,
> > > 
> > > Robin Diederen schrieb:
> > >> Hello Andreas,
> > >>
> > >> I did have a look at the PrintDocumentMetaData.java fille; there I find
> > that using the PDDocumentInformation metadata is extracted. This code is
> > useful for PDF files with "classic" metadata, but not for PDF files only
> > carrying XMP metadata, right?
> > > OK, I see. I'm not that familiar with the XMP stuff, but I guess I
> > > understand your problem.
> > > 
> > >> There's my issue.. as soon as I have a PDF file with only XMP metadata I
> > need some other way to extract this metadata..
> > > I'm afraid that pdfbox is yet limited to the handling of "classic"
> > metadata.
> > > 
> > > 
> > >> Best, Robin
> > >>  
> > >> -----Original message-----
> > >> From: Andreas Lehmkühler <andr...@lehmi.de>
> > >> Sent: Thu 22-10-2009 21:47
> > >> To: pdfbox-users@incubator.apache.org; 
> > >> Subject: Re: Question XMP metadata extraction
> > >>
> > >> Hi,
> > >>
> > >> Robin Diederen schrieb:
> > >>> Hello all,
> > >>>
> > >>> I'm quite new to PDFbox and currently figuring out how to extract
> > metadata from PDF files which is in XMP format.
> > >>>
> > >>> I have a few files containing XMP metadata, but I can not get any of
> > those to work. And I can't seem to figure out where I am failing.
> > >>>
> > >>> A code snippet (all non-relevant code was deleted):
> > >>>
> > >>> String inputFile = "/some/file.pdf"
> > >>>
> > >>> PDDocument pdfDocument = null;
> > >>> pdfDocument = new PDDocument();
> > >>> pdfDocument = PDDocument.load(inputFile);     
> > >>> PDMetadata pdfMetaData = new PDMetadata(pdfDocument);
> > >>>             
> > >>> int metadataLength = pdfMetaData.getLength();
> > >>> System.out.println(pdfMetaData.getLength());
> > >>>  
> > >>>
> > >>> pdfMetaData.exportXMPMetadata();
> > >>>  
> > >>>
> > >>> The getLength call always returns 0; the exportXMPMetadata call returns
> > an error:
> > >>>
> > >>> [Fatal Error] :-1:-1: Premature end of file.
> > >>> Exception in thread "main" java.io.IOException: Premature end of file.
> > >>>     at org.apache.jempbox.impl.XMLUtil.parse(XMLUtil.java:78)
> > >>>     at org.apache.jempbox.xmp.XMPMetadata.load(XMPMetadata.java:554)
> > >>>     at
> > org.apache.pdfbox.pdmodel.common.PDMetadata.exportXMPMetadata(PDMetadata.jav
> > a:86)
> > >>>     at
> > com.robindiederen.pdf.Extractor.extractMetaDataFromXMP(Extractor.java:124)
> > >>>     at com.robindiederen.pdf.Extractor.main(Extractor.java:90)
> > >>>
> > >>>  
> > >>>
> > >>> This happens for every PDF I test. Extracting metadata from the
> > DocumentInformation table works as a charm. I'm using PDFbox 0.80 on Java
> > 1.5.
> > >> Have a look at PrintDocumentMetaData as an example how to extract the
> > >> docs metadata.
> > >>
> > >> HTH
> > >> Andreas Lehmkühler
> > >>
> > >>
> > > BR
> > > Andreas Lehmkühler
> > > 
> > > 
> > > 
> > 
> > 
> > 
> 
> --- original Nachricht Ende ----




Jeremias Maerki

Re: Question XMP metadata extraction

Reply via email to