Of course I agree with the need for regression tests, however it isn't
easy: besides the problems of the different JDKs (I use JDK7 Windows 64
bit), there is the problem that some enhancements create slight changes
in rendering that are not errors, i.e. both the "before" and the "after"
files look OK by itself. This has happened when we changed the text
rendering recently, and has happened again when the clipping was
improved. The cause are probably slight changes in color or in boundaries.
Copyrights is a problem: I'm testing mostly with JIRA attachments that
I've downloaded over the years. While uploading such files to JIRA might
count as fair use, I doubt that this would still be true if they are
included in a distribution. Instead, they should be stored somewhere on
Apache servers where only committers and build software ("Travis",
"Jenkins", ...) can access then. The public PDFs that Maruan mentions
don't possibly have all the Problem cases that we solved before. However
I have started working with these files and there are at least 5 recent
issues that deals with them.
I'm using an improved version of the TestPDFToImage class and I will
commit it within a few days, but I must clean it up first.
Re preflight: the default mode should be to have the Isartor tests on.
Individuals could still disable them locally, but the central build
software should always use them.
Tilman
Am 04.07.2014 08:43, schrieb Maruan Sahyoun:
Hi John,
thanks for binging this up. This is a very important topic which was also
discussed at the PDFDays in Germany.
# Tests #
In addition to rendering we shall be covering metadata and text extraction as
well as PDF/A validation.
# Testfiles #
Recently there were a number of test sets made available which we can use.
http://digitalcorpora.org/corpora/files ,
https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite
http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions
apply there.
In addition we can put additional files into our own repository as you
suggested.
So there is no shortage on test files.
TIKA-1300/TIKA-1302 has a discussion around the same topic together with some
development for an infrastructure (VM, Jenkins …). IMHO we should join forces
with them.
BR
Maruan
Am 04.07.2014 um 02:16 schrieb John Hewson <j...@jahewson.com>:
Hi All
I’ve been thinking about regression testing recently and how we can improve
our tests for rendering. There are currently two problems:
1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
(I suspect that AWT fonts are a big part of this, so the problem might get
a lot better
soon once we render all fonts ourselves).
2) Most PDF test files we have are not under an Apache-friendly license, so
we can’t put the test files into the trunk SVN.
It seems that some of you have your own collections of test PDF files which you
are
running regression tests on: that’s great but it would be much better if we had
a
central repository of test files and sample renderings.
I’d like to suggest the following solutions to the above issues:
1) We should choose a “blessed” JDK which will be used to perform the renderings
this should be whatever is a convenient and sensible default for
committers. (My
preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
rendering bugs). We should make sure that Jenkins runs tests using the
”blessed”
JDK.
The regression test can then check to see if it is running on the “blessed”
JDK and
if not then the tests can be skipped and we can warn the user.
2) We should create a new “regression” branch in SVN which contains only PDF
files
for testing and PNG images which contain known-good renderings created
using the
“blessed” JDK. This branch would not be part of the source of PDFBox but
will still
allow us to version control the test PDFs (it also simplifies the workflow
for adding
new test PDFs and new known-good renderings: simply do an "svn add”).
As far as copyright and licensing is concerned we can put any PDF files
which are
available publicly on the web into this branch without too much worry.
What does everybody think?
-- John