To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.

I'm not sure how much of that code would be useful but I may be able to 
contribute some of it.

Regards,

Ray


> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <[email protected]> wrote:
> 
> Thank you.  Will take a look.
> 
> -----Original Message-----
> From: Ray Gauss [mailto:[email protected]] 
> Sent: Tuesday, March 08, 2016 1:55 PM
> To: [email protected]
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> Hi Tim,
> 
> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing 
> XMP (though has not seen updates in a while), but getting the XMP packets out 
> of the files is tricker.  
> 
> We have XMPPacketScanner which works for many cases, but not all.  InDesign 
> files for example do some strange things.
> 
> In the past we've used different packet scanners depending on the file type 
> (including Exiftool command-line) to get the XMP out then used xmpcore to 
> parse into simple flattened properties.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <[email protected]> wrote:
>> 
>> All,
>> 
>> PDFBox 2.0 is soon to be released.  In the course of its development, the 
>> project has migrated from Jempbox (which we're now using) to XmpBox; and 
>> Jempbox is now on its last legs.  
>> 
>> XmpBox was "written for PDF/A checking," not for robust processing of common 
>> variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I 
>> pulled out of PDFs from govdocs1/commoncrawl.
>> 
>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>> 
>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there 
>> better options than copying and pasting jempbox into Tika and maintaining it 
>> ourselves (yuk!)?
>> 
>>         Best,
>> 
>>                Tim
>> 
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:[email protected]]
>> Sent: Tuesday, March 08, 2016 12:13 PM
>> To: [email protected]
>> Subject: Re: roadmap for XMPBox?
>> 
>> I think the problem is that XmpBox was written for PDF/A checking, so it 
>> fails with XMPs that are not PDF/A. For example, file 000142.pdf has the 
>> schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p
>> roperties_in_pdfa-1_2008-03-20.pdf
>> 
>> And no, there are no plans for anything on XMP at this time...
>> 
>> Tilman
>> 
>> 
>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>> All,
>>> 
>>> 
>>> 
>>>  When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch 
>>> from our current reliance on jempbox to XMPBox.  I recently extracted ~70k 
>>> XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, 
>>> there were exceptions on roughly 40% of the XMPs.
>>> 
>>> 
>>> 
>>>  I’m including a table below of the counts of exception messages.  Are 
>>> there any plans to make XMPBox more lenient or is this what we can expect 
>>> going forward?
>>> 
>>> 
>>> 
>>>  As always, I’m more than happy to help with files and tests.  Let me know 
>>> what I can do.
>>> 
>>> 
>>> 
>>>             Cheers,
>>> 
>>> 
>>> 
>>>                      Tim
>>> 
>>> 
>>> 
>>> No XmpParsingException on 42,022 files.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Exceptions:
>>> 
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/pdfx/1.3/
>>> 
>>> 13403
>>> 
>>> Type 'originalDocumentID' not defined in 
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>> 
>>> 3710
>>> 
>>> Missing pdfaSchema:property in type definition
>>> 
>>> 3113
>>> 
>>> Expecting namespace 'adobe:ns:meta/' and found 
>>> 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>> 
>>> 2867
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>> name=creator]
>>> 
>>> 927
>>> 
>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>> name=description]
>>> 
>>> 723
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/xmp/InDesign/private
>>> 
>>> 710
>>> 
>>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>>> name=subject]
>>> 
>>> 654
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>> 
>>> 522
>>> 
>>> Failed to parse
>>> 
>>> 492
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=date]
>>> 
>>> 370
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/illustrator/1.0/
>>> 
>>> 262
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/xfa/promoted-desc/
>>> 
>>> 188
>>> 
>>> Failed to instanciate property in xmp:CreateDate
>>> 
>>> 144
>>> 
>>> Schema is not set in this document : 
>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>> 
>>> 125
>>> 
>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>> 
>>> 94
>>> 
>>> Cannot find a definition for the namespace
>>> http://www.rwjf.org/rwjf/1.0
>>> 
>>> 84
>>> 
>>> Failed to instanciate property in xap:CreateDate
>>> 
>>> 74
>>> 
>>> Invalid array definition, expecting Bag and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=language]
>>> 
>>> 68
>>> 
>>> Invalid array definition, expecting Alt and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=title]
>>> 
>>> 49
>>> 
>>> Cannot find a definition for the namespace http://www.sap.com
>>> 
>>> 46
>>> 
>>> Failed to instanciate property in exif:ColorSpace
>>> 
>>> 33
>>> 
>>> Failed to instanciate property in xmpMM:History
>>> 
>>> 28
>>> 
>>> xmp should start with a processing instruction
>>> 
>>> 26
>>> 
>>> Cannot find a definition for the namespace 
>>> http://prismstandard.org/namespaces/basic/2.0/
>>> 
>>> 24
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.npes.org/pdfx/ns/id/
>>> 
>>> 21
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>> 
>>> 14
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=creator]
>>> 
>>> 14
>>> 
>>> Failed to instanciate property in xmp:MetadataDate
>>> 
>>> 12
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.xinet.com/webnative/private/1.0/
>>> 
>>> 10
>>> 
>>> Failed to instanciate property in xap:ModifyDate
>>> 
>>> 10
>>> 
>>> Failed to instanciate property in xmp:ModifyDate
>>> 
>>> 10
>>> 
>>> Type 'params' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>> 
>>> 9
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>>> name=History]
>>> 
>>> 8
>>> 
>>> Type 'documentName' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>> 
>>> 8
>>> 
>>> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>>> 
>>> 7
>>> 
>>> Cannot find a definition for the namespace ptc
>>> 
>>> 7
>>> 
>>> Failed to instanciate property in xapMM:History
>>> 
>>> 6
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
>>> name=YCbCrPositioning]
>>> 
>>> 5
>>> 
>>> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>>> 
>>> 5
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.extensis.com/meta/FontSense/
>>> 
>>> 4
>>> 
>>> Excepted xpacket 'end' attribute (must be present and placed in 
>>> first)
>>> 
>>> 4
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>>> name=TextLayers]
>>> 
>>> 3
>>> 
>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>> 
>>> 3
>>> 
>>> no message (NPE)
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://laserfiche.com/xmp/schema/1.0/
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>> 
>>> 2
>>> 
>>> Failed to instanciate property in xapRights:Marked
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>>> name=title]
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>> name=title]
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>>> name=creator]
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.cambridgeassociates.com/status/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.computershare.com.au/ccs/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.esko-graphics.com/grinfo/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.tripletriangle.com/ns/tripletri/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://prismstandard.org/namespaces/basic/2.1/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.aiim.org/pdfa/ns/id.html
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.aiim.org/pdfe/ns/id/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.northplains.com/xmpnps/cov/1.0/
>>> 
>>> 1
>>> 
>>> Failed to instanciate property in xmpRights:Marked
>>> 
>>> 1
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>> name=date]
>>> 
>>> 1
>>> 
>>> This namespace is not a schema or a structured type : 
>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>> 
>>> 1
>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] For 
>> additional commands, e-mail: [email protected]
>> 
> 

Reply via email to