To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.
I'm not sure how much of that code would be useful but I may be able to contribute some of it. Regards, Ray > On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <[email protected]> wrote: > > Thank you. Will take a look. > > -----Original Message----- > From: Ray Gauss [mailto:[email protected]] > Sent: Tuesday, March 08, 2016 1:55 PM > To: [email protected] > Subject: Re: [DISCUSS] options for XMP parsing? > > Hi Tim, > > We're already using Adobe's xmpcore in tika-xmp which works fine for parsing > XMP (though has not seen updates in a while), but getting the XMP packets out > of the files is tricker. > > We have XMPPacketScanner which works for many cases, but not all. InDesign > files for example do some strange things. > > In the past we've used different packet scanners depending on the file type > (including Exiftool command-line) to get the XMP out then used xmpcore to > parse into simple flattened properties. > > Regards, > > Ray > > >> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <[email protected]> wrote: >> >> All, >> >> PDFBox 2.0 is soon to be released. In the course of its development, the >> project has migrated from Jempbox (which we're now using) to XmpBox; and >> Jempbox is now on its last legs. >> >> XmpBox was "written for PDF/A checking," not for robust processing of common >> variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I >> pulled out of PDFs from govdocs1/commoncrawl. >> >> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life. >> >> Has anyone had any luck with an Apache-friendly XMP parser? Are there >> better options than copying and pasting jempbox into Tika and maintaining it >> ourselves (yuk!)? >> >> Best, >> >> Tim >> >> -----Original Message----- >> From: Tilman Hausherr [mailto:[email protected]] >> Sent: Tuesday, March 08, 2016 12:13 PM >> To: [email protected] >> Subject: Re: roadmap for XMPBox? >> >> I think the problem is that XmpBox was written for PDF/A checking, so it >> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the >> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A: >> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p >> roperties_in_pdfa-1_2008-03-20.pdf >> >> And no, there are no plans for anything on XMP at this time... >> >> Tilman >> >> >> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.: >>> All, >>> >>> >>> >>> When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch >>> from our current reliance on jempbox to XMPBox. I recently extracted ~70k >>> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, >>> there were exceptions on roughly 40% of the XMPs. >>> >>> >>> >>> I’m including a table below of the counts of exception messages. Are >>> there any plans to make XMPBox more lenient or is this what we can expect >>> going forward? >>> >>> >>> >>> As always, I’m more than happy to help with files and tests. Let me know >>> what I can do. >>> >>> >>> >>> Cheers, >>> >>> >>> >>> Tim >>> >>> >>> >>> No XmpParsingException on 42,022 files. >>> >>> >>> >>> >>> >>> >>> >>> Exceptions: >>> >>> >>> Cannot find a definition for the namespace >>> http://ns.adobe.com/pdfx/1.3/ >>> >>> 13403 >>> >>> Type 'originalDocumentID' not defined in >>> http://ns.adobe.com/xap/1.0/sType/ResourceRef# >>> >>> 3710 >>> >>> Missing pdfaSchema:property in type definition >>> >>> 3113 >>> >>> Expecting namespace 'adobe:ns:meta/' and found >>> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#' >>> >>> 2867 >>> >>> Invalid array type, expecting Seq and found Bag [prefix=dc; >>> name=creator] >>> >>> 927 >>> >>> Invalid array type, expecting Alt and found Seq [prefix=dc; >>> name=description] >>> >>> 723 >>> >>> Cannot find a definition for the namespace >>> http://ns.adobe.com/xmp/InDesign/private >>> >>> 710 >>> >>> Invalid array type, expecting Bag and found Seq [prefix=dc; >>> name=subject] >>> >>> 654 >>> >>> Cannot find a definition for the namespace >>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/ >>> >>> 522 >>> >>> Failed to parse >>> >>> 492 >>> >>> Invalid array definition, expecting Seq and found >>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>> name=date] >>> >>> 370 >>> >>> Cannot find a definition for the namespace >>> http://ns.adobe.com/illustrator/1.0/ >>> >>> 262 >>> >>> Cannot find a definition for the namespace >>> http://ns.adobe.com/xfa/promoted-desc/ >>> >>> 188 >>> >>> Failed to instanciate property in xmp:CreateDate >>> >>> 144 >>> >>> Schema is not set in this document : >>> http://www.w3.org/1999/02/22-rdf-syntax-ns# >>> >>> 125 >>> >>> Expecting local name 'xmpmeta' and found 'xapmeta' >>> >>> 94 >>> >>> Cannot find a definition for the namespace >>> http://www.rwjf.org/rwjf/1.0 >>> >>> 84 >>> >>> Failed to instanciate property in xap:CreateDate >>> >>> 74 >>> >>> Invalid array definition, expecting Bag and found >>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>> name=language] >>> >>> 68 >>> >>> Invalid array definition, expecting Alt and found >>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>> name=title] >>> >>> 49 >>> >>> Cannot find a definition for the namespace http://www.sap.com >>> >>> 46 >>> >>> Failed to instanciate property in exif:ColorSpace >>> >>> 33 >>> >>> Failed to instanciate property in xmpMM:History >>> >>> 28 >>> >>> xmp should start with a processing instruction >>> >>> 26 >>> >>> Cannot find a definition for the namespace >>> http://prismstandard.org/namespaces/basic/2.0/ >>> >>> 24 >>> >>> Cannot find a definition for the namespace >>> http://www.npes.org/pdfx/ns/id/ >>> >>> 21 >>> >>> Cannot find a definition for the namespace >>> http://ns.InsiderSoftware.com/fontlist/1.0/ >>> >>> 14 >>> >>> Invalid array definition, expecting Seq and found >>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; >>> name=creator] >>> >>> 14 >>> >>> Failed to instanciate property in xmp:MetadataDate >>> >>> 12 >>> >>> Cannot find a definition for the namespace >>> http://ns.xinet.com/webnative/private/1.0/ >>> >>> 10 >>> >>> Failed to instanciate property in xap:ModifyDate >>> >>> 10 >>> >>> Failed to instanciate property in xmp:ModifyDate >>> >>> 10 >>> >>> Type 'params' not defined in >>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent# >>> >>> 9 >>> >>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; >>> name=History] >>> >>> 8 >>> >>> Type 'documentName' not defined in >>> http://ns.adobe.com/xap/1.0/sType/ResourceRef# >>> >>> 8 >>> >>> Cannot find a definition for the namespace http://www.day.com/dam/1.0 >>> >>> 7 >>> >>> Cannot find a definition for the namespace ptc >>> >>> 7 >>> >>> Failed to instanciate property in xapMM:History >>> >>> 6 >>> >>> Invalid array definition, expecting Seq and found >>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; >>> name=YCbCrPositioning] >>> >>> 5 >>> >>> Schema is not set in this document : http://purl.org/dc/elements/1.1/ >>> >>> 5 >>> >>> Cannot find a definition for the namespace >>> http://www.extensis.com/meta/FontSense/ >>> >>> 4 >>> >>> Excepted xpacket 'end' attribute (must be present and placed in >>> first) >>> >>> 4 >>> >>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; >>> name=TextLayers] >>> >>> 3 >>> >>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/ >>> >>> 3 >>> >>> no message (NPE) >>> >>> 2 >>> >>> Cannot find a definition for the namespace >>> http://laserfiche.com/xmp/schema/1.0/ >>> >>> 2 >>> >>> Cannot find a definition for the namespace >>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/ >>> >>> 2 >>> >>> Cannot find a definition for the namespace >>> http://ns.adobe.com/camera-raw-settings/1.0/ >>> >>> 2 >>> >>> Failed to instanciate property in xapRights:Marked >>> >>> 2 >>> >>> Invalid array type, expecting Alt and found Bag [prefix=dc; >>> name=title] >>> >>> 2 >>> >>> Invalid array type, expecting Alt and found Seq [prefix=dc; >>> name=title] >>> >>> 2 >>> >>> Invalid array type, expecting Seq and found Alt [prefix=dc; >>> name=creator] >>> >>> 2 >>> >>> Cannot find a definition for the namespace >>> http://ns.cambridgeassociates.com/status/1.0/ >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://ns.computershare.com.au/ccs/1.0/ >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://ns.esko-graphics.com/grinfo/1.0/ >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://ns.tripletriangle.com/ns/tripletri/ >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://prismstandard.org/namespaces/basic/2.1/ >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://www.aiim.org/pdfa/ns/id.html >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://www.aiim.org/pdfe/ns/id/ >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://www.enfocus.com/ns/CertifiedPDF/2.0/ >>> >>> 1 >>> >>> Cannot find a definition for the namespace >>> http://www.northplains.com/xmpnps/cov/1.0/ >>> >>> 1 >>> >>> Failed to instanciate property in xmpRights:Marked >>> >>> 1 >>> >>> Invalid array type, expecting Seq and found Bag [prefix=dc; >>> name=date] >>> >>> 1 >>> >>> This namespace is not a schema or a structured type : >>> http://ns.adobe.com/xap/1.0/sType/Job# >>> >>> 1 >>> >>> >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] For >> additional commands, e-mail: [email protected] >> >
