[jira] [Commented] (PDFBOX-1812) Illegal characters in XML output

Johan van der Knijff (JIRA) Thu, 16 Jan 2014 06:34:31 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873431#comment-13873431
 ]


Johan van der Knijff commented on PDFBOX-1812:
----------------------------------------------

{quote}
Andreas Lehmkühler  - 20/Dec/13 08:12

The question is: is it a problem that the parser now has some sort of a 
self-healing mechanism? Is a PDF/A1b document with a broken XRef table still a 
valid pdf?
{quote}

I overlooked this comment earlier on.  As I see it, ideally, PDF/A validation 
should be a 2-stage process:

# Validate if the PDF conforms to the general PDF spec (for PDF/A-1b this would 
be PDF 1.4, although ISO 32000 would probably be more practical).
# Check if the PDF conforms to the additional constraints imposed by PDF/A-1b.

This would mean that a PDF can _only_ be valid PDF/A-1b if it passes both 
tests. A broken XRef table would break 1, so it wouldn't be valid PDF/A-1b 
either.

Needless to say 1 is a _huge_ task, and I'm not aware of any software that is 
currently capable of this (although some people are lobbying for initiating 
something like this, see e.g:
[http://www.pdfa.org/video/duff-johnson-why-validation/])

>From what I understand _Preflight_ mostly focuses on 2 (although I think it 
>includes quite a few elements from 1 as well).

Just my 2 cents ...


> Illegal characters in XML output
> --------------------------------
>
>                 Key: PDFBOX-1812
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1812
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Preflight
>    Affects Versions: 2.0.0
>         Environment: Bug reproduced under Win 7, Ubuntu
>            Reporter: Johan van der Knijff
>            Assignee: Andreas Lehmkühler
>              Labels: characters, utf-8, xml
>             Fix For: 1.8.4, 2.0.0
>
>         Attachments: 013814.pdf, 013814.xml, 013814_old.xml, 
> 1812-additionalPDFs09012014.zip, 598659.pdf, 598659.xml, 598659_old.xml, 
> 600111.pdf, 600111.xml, 600111_old.xml, preflight-app.jar
>
>
> When running Preflight in XML mode, the latest Preflight version (I used the 
> JAR from build #747) sometimes produces output that contains characters that 
> are illegal in XML. This can cause unexpected behavior if such files are 
> further processed with tools that expect well-formed XML.  See attached PDFs, 
> which all result in illegal characters in the description of a 1.0 Syntax 
> error, Error: Expected a long type. Output of older versions of Preflight 
> didn't contain these illegal characters; instead they would give something 
> like *actual='/O'*, *actual='Pages'*. etc. So I suppose this must have been 
> caused by a fairly recent change.
> See attachments below.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (PDFBOX-1812) Illegal characters in XML output

Reply via email to