[GitHub] [pdfbox] gunnar-ifp opened a new pull request #123: lenient DomXmpParser

GitBox Wed, 23 Jun 2021 05:14:28 -0700


gunnar-ifp opened a new pull request #123:
URL: https://github.com/apache/pdfbox/pull/123



   The XMP box library is nice, but out in the wild are PDF files that fail 
parsing. For example dc.create is a Bag instead of a Seq.
   
   Ideally the parser would have a mode where it tries to read as many 
properties as possible by simply discarding unreadable ones. This is not good 
if you want to write back a PDF but if you just want to extract Metadata, such 
a mode would be nice. In this case this invalid dc.creator value would be 
dropped. This would require doing some more work.
   
   I've seen that there is a non strict parsing mode, which I don't think 
should be confused with this proposed lenient mode, but as the name suggests it 
should be less strict. So in this mode Sequences could be read fom Bags and 
vice versa. I left Alt cardinality as an error because it doesn't really fit in.
   
   Maybe in one of the modes an element that should be an array but isn't could 
automagically be wrapped into one...
   
   (I also believe that a Bag could always be read from a Sequence...)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pdfbox] gunnar-ifp opened a new pull request #123: lenient DomXmpParser

Reply via email to