Re: Paid PDFBox support

John Hewson Tue, 08 Jul 2014 18:25:26 -0700

Hi Leonard

I’ve found one... it took some searching, as I don’t have an automated way to

extract them from my dataset. Of the PDFs I have 99% have no syntax

errors but instead relate to bugs in PDFBox.

The attached PDF is really quite broken, it can’t be viewed correctly in

Acrobat, although it can be opened. OS X Preview renders the file

perfectly, presumably that’s how it was being used.

At a glance it appears to be missing its xref table and has strings in the

MediaBox instead of numbers.

I often encounter files which pass the Preflight syntax check but have

problems which we thought were syntax errors such as extra entries

in arrays - would you be interested in such files? I can collect a few.

PS. I’ve CC's this e-mail to you as the dev list will most likely remove the

attachment.

damaged.pdf
Description: Adobe PDF document

Thanks

-- John

On 8 Jul 2014, at 13:40, Leonard Rosenthol <lrose...@adobe.com> wrote:

Hmmm…If you have one of those, I’d love to see it!

Thanks,
Leonard

On 7/8/14, 4:30 PM, "John Hewson" <j...@jahewson.com> wrote:

That’s good to know. I guess that the last file I tried it on contained
low-level syntax errors in the same place it had some key/value errors so
I conflated the two.

My main problem has been that most of the time when a file does contain
an error, rather than getting a description of the error, the Preflight
process fails with only the message:

"An error occurred while parsing a contents stream.
Unable to analyse the PDF file."

-- John

On 8 Jul 2014, at 13:04, Leonard Rosenthol <lrose...@adobe.com> wrote:

Actually, John, it won’t report on either of those things you’ve
mentioned. :)

What it does, however, is check every key & every value in every
dictionary, and each element of every array in the Body of the PDF to
make
sure that their name, value, type, presence (or not) matches what it
says
in the spec. It also does the same for all content streams (whether
page, XObject, AP, etc.). It also attempts to load every Font and ICC
Profile that is referenced for at least basic validity.

Leonard

On 7/8/14, 3:50 PM, "John Hewson" <j...@jahewson.com> wrote:

That’s only going to find the most basic syntax errors though, such as
a
dictionary with a missing >> or and object which ends without “endobj”.
It doesn’t check that the structure of the PDF is valid.

I run the preflight syntax error check on almost every problematic PDF
which we get on JIRA and I’ve had it report an issue maybe twice.

-- John

On 8 Jul 2014, at 12:35, Leonard Rosenthol <lrose...@adobe.com> wrote:

Actually, preflight has an option called ³Report PDF Syntax Errors²
which
WILL check against ISO 32000-1 compliance - at least for the PDF body
objects themselves.

Leonard

On 7/8/14, 2:02 PM, "John Hewson" <j...@jahewson.com> wrote:

On 8 Jul 2014, at 10:53, Martin Schröder <mar...@oneiros.de> wrote:

2014-07-08 19:49 GMT+02:00 John Hewson <j...@jahewson.com>:
In Adobe Acrobat this file has only two pages, so as noted the root
of
the page tree is invalid:

/Kids [3 0 R, 3 0 R, 3 0 R]

This is IMHO perfectly valid.

In cases like this where the spec is vague we rely on Acrobat¹s
behaviour
to decide what is and isn¹t valid.

Has anybody tried preflighting the pdf with Acrobat?

Preflight can do some basic checks on ²standard" PDFs but it¹s really
limited, it¹s mostly for PDF/A, because the ²standard² PDF spec is
too
vague to be used to verify conformance (many usable PDFs are
non-conformant anyway).

-- John

Re: Paid PDFBox support

Reply via email to