Am 08.03.2016 um 19:30 schrieb Allison, Timothy B.:
The comment I made is just my personal opinion. ... Maybe improve XMPBox as you
suggested (I did have a look but it doesn't seem easy).
Oh, ok, so it isn't necessarily set in stone.
IMHO, no, we are always open for proposals :-)
What do other PDFBox devs think? Is there interest in modifying XmpBox to be
more lenient? Not for 2.0.0, obviously... :)
We should try to improve XmpBox as long as it is reasonable. XmpBox should not
be limited to PDF/A. But we need proper documentation for the missing namespaces.
BR
Andreas
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Tuesday, March 08, 2016 12:56 PM
To: [email protected]
Subject: Re: roadmap for XMPBox?
Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.:
Got it. Thank you. I wanted to confirm that nothing had changed since last
summer (PDFBOX-2855).
Are you taking bug reports for jempbox or is that entirely eol'd?
Yes, I recently fixed a bug there.
Any recommendations for a somewhat lenient, Apache license-compatible XMP
parser?
Sorry, don't know.
Might it make sense to include in the README or in the package
javadocs something about the goals for XmpBox? It is entirely
possible that I missed the warning. ;)
The comment I made is just my personal opinion. It's your comment that made me
realize that with XMPBox, we can't parse some files that are not PDF/A
compatible but are correct XMP files. I don't have an idea what to do. Maybe
improve XMPBox as you suggested (I did have a look but it doesn't seem easy).
Maybe resurrect Jempbox, or use the 1.8 version.
Tilman
Thank you, again.
Best,
Tim
-----Original Message-----
From: Tilman Hausherr [mailto:[email protected]]
Sent: Tuesday, March 08, 2016 12:13 PM
To: [email protected]
Subject: Re: roadmap for XMPBox?
I think the problem is that XmpBox was written for PDF/A checking, so it fails
with XMPs that are not PDF/A. For example, file 000142.pdf has the schema
http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p
roperties_in_pdfa-1_2008-03-20.pdf
And no, there are no plans for anything on XMP at this time...
Tilman
Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
All,
When we migrate to PDFBox 2.x over on Tika, I'd much prefer to switch
from our current reliance on jempbox to XMPBox. I recently extracted ~70k XMPs
from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there
were exceptions on roughly 40% of the XMPs.
I’m including a table below of the counts of exception messages. Are
there any plans to make XMPBox more lenient or is this what we can expect going
forward?
As always, I’m more than happy to help with files and tests. Let me know
what I can do.
Cheers,
Tim
No XmpParsingException on 42,022 files.
Exceptions:
Cannot find a definition for the namespace
http://ns.adobe.com/pdfx/1.3/
13403
Type 'originalDocumentID' not defined in
http://ns.adobe.com/xap/1.0/sType/ResourceRef#
3710
Missing pdfaSchema:property in type definition
3113
Expecting namespace 'adobe:ns:meta/' and found
'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
2867
Invalid array type, expecting Seq and found Bag [prefix=dc;
name=creator]
927
Invalid array type, expecting Alt and found Seq [prefix=dc;
name=description]
723
Cannot find a definition for the namespace
http://ns.adobe.com/xmp/InDesign/private
710
Invalid array type, expecting Bag and found Seq [prefix=dc;
name=subject]
654
Cannot find a definition for the namespace
http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
522
Failed to parse
492
Invalid array definition, expecting Seq and found
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
name=date]
370
Cannot find a definition for the namespace
http://ns.adobe.com/illustrator/1.0/
262
Cannot find a definition for the namespace
http://ns.adobe.com/xfa/promoted-desc/
188
Failed to instanciate property in xmp:CreateDate
144
Schema is not set in this document :
http://www.w3.org/1999/02/22-rdf-syntax-ns#
125
Expecting local name 'xmpmeta' and found 'xapmeta'
94
Cannot find a definition for the namespace
http://www.rwjf.org/rwjf/1.0
84
Failed to instanciate property in xap:CreateDate
74
Invalid array definition, expecting Bag and found
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
name=language]
68
Invalid array definition, expecting Alt and found
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
name=title]
49
Cannot find a definition for the namespace http://www.sap.com
46
Failed to instanciate property in exif:ColorSpace
33
Failed to instanciate property in xmpMM:History
28
xmp should start with a processing instruction
26
Cannot find a definition for the namespace
http://prismstandard.org/namespaces/basic/2.0/
24
Cannot find a definition for the namespace
http://www.npes.org/pdfx/ns/id/
21
Cannot find a definition for the namespace
http://ns.InsiderSoftware.com/fontlist/1.0/
14
Invalid array definition, expecting Seq and found
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
name=creator]
14
Failed to instanciate property in xmp:MetadataDate
12
Cannot find a definition for the namespace
http://ns.xinet.com/webnative/private/1.0/
10
Failed to instanciate property in xap:ModifyDate
10
Failed to instanciate property in xmp:ModifyDate
10
Type 'params' not defined in
http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
9
Invalid array type, expecting Seq and found Bag [prefix=xmpMM;
name=History]
8
Type 'documentName' not defined in
http://ns.adobe.com/xap/1.0/sType/ResourceRef#
8
Cannot find a definition for the namespace http://www.day.com/dam/1.0
7
Cannot find a definition for the namespace ptc
7
Failed to instanciate property in xapMM:History
6
Invalid array definition, expecting Seq and found
com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff;
name=YCbCrPositioning]
5
Schema is not set in this document : http://purl.org/dc/elements/1.1/
5
Cannot find a definition for the namespace
http://www.extensis.com/meta/FontSense/
4
Excepted xpacket 'end' attribute (must be present and placed in
first)
4
Invalid array type, expecting Seq and found Bag [prefix=photoshop;
name=TextLayers]
3
Schema is not set in this document : http://ns.adobe.com/xap/1.0/
3
no message (NPE)
2
Cannot find a definition for the namespace
http://laserfiche.com/xmp/schema/1.0/
2
Cannot find a definition for the namespace
http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
2
Cannot find a definition for the namespace
http://ns.adobe.com/camera-raw-settings/1.0/
2
Failed to instanciate property in xapRights:Marked
2
Invalid array type, expecting Alt and found Bag [prefix=dc;
name=title]
2
Invalid array type, expecting Alt and found Seq [prefix=dc;
name=title]
2
Invalid array type, expecting Seq and found Alt [prefix=dc;
name=creator]
2
Cannot find a definition for the namespace
http://ns.cambridgeassociates.com/status/1.0/
1
Cannot find a definition for the namespace
http://ns.computershare.com.au/ccs/1.0/
1
Cannot find a definition for the namespace
http://ns.esko-graphics.com/grinfo/1.0/
1
Cannot find a definition for the namespace
http://ns.tripletriangle.com/ns/tripletri/
1
Cannot find a definition for the namespace
http://prismstandard.org/namespaces/basic/2.1/
1
Cannot find a definition for the namespace
http://www.aiim.org/pdfa/ns/id.html
1
Cannot find a definition for the namespace
http://www.aiim.org/pdfe/ns/id/
1
Cannot find a definition for the namespace
http://www.enfocus.com/ns/CertifiedPDF/2.0/
1
Cannot find a definition for the namespace
http://www.northplains.com/xmpnps/cov/1.0/
1
Failed to instanciate property in xmpRights:Marked
1
Invalid array type, expecting Seq and found Bag [prefix=dc;
name=date]
1
This namespace is not a schema or a structured type :
http://ns.adobe.com/xap/1.0/sType/Job#
1
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For
additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For
additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]