Of course I agree with the need for regression tests, however it isn't easy: besides the problems of the different JDKs (I use JDK7 Windows 64 bit), there is the problem that some enhancements create slight changes in rendering that are not errors, i.e. both the "before" and the "after" files look OK by itself. This has happened when we changed the text rendering recently, and has happened again when the clipping was improved. The cause are probably slight changes in color or in boundaries.

Copyrights is a problem: I'm testing mostly with JIRA attachments that I've downloaded over the years. While uploading such files to JIRA might count as fair use, I doubt that this would still be true if they are included in a distribution. Instead, they should be stored somewhere on Apache servers where only committers and build software ("Travis", "Jenkins", ...) can access then. The public PDFs that Maruan mentions don't possibly have all the Problem cases that we solved before. However I have started working with these files and there are at least 5 recent issues that deals with them.

I'm using an improved version of the TestPDFToImage class and I will commit it within a few days, but I must clean it up first.

Re preflight: the default mode should be to have the Isartor tests on. Individuals could still disable them locally, but the central build software should always use them.

Tilman


Am 04.07.2014 08:43, schrieb Maruan Sahyoun:
Hi John,

thanks for binging this up. This is a very important topic which was also 
discussed at the PDFDays in Germany.

  # Tests #
In addition to rendering we shall be covering metadata and text extraction as 
well as PDF/A validation.

# Testfiles #
Recently there were a number of test sets made available which we can use. 
http://digitalcorpora.org/corpora/files , 
https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite 
http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
apply there.
In addition we can put additional files into our own repository as you 
suggested.
So there is no shortage on test files.

TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
with them.

BR

Maruan


Am 04.07.2014 um 02:16 schrieb John Hewson <j...@jahewson.com>:

Hi All

I’ve been thinking about regression testing recently and how we can improve
our tests for rendering. There are currently two problems:

1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
    (I suspect that AWT fonts are a big part of this, so the problem might get 
a lot better
    soon once we render all fonts ourselves).

2) Most PDF test files we have are not under an Apache-friendly license, so
    we can’t put the test files into the trunk SVN.

It seems that some of you have your own collections of test PDF files which you 
are
running regression tests on: that’s great but it would be much better if we had 
a
central repository of test files and sample renderings.

I’d like to suggest the following solutions to the above issues:

1) We should choose a “blessed” JDK which will be used to perform the renderings
    this should be whatever is a convenient and sensible default for 
committers. (My
    preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has known
    rendering bugs). We should make sure that Jenkins runs tests using the 
”blessed”
    JDK.

   The regression test can then check to see if it is running on the “blessed” 
JDK and
   if not then the tests can be skipped and we can warn the user.

2) We should create a new “regression” branch in SVN which contains only PDF 
files
    for testing and PNG images which contain known-good renderings created 
using the
    “blessed” JDK. This branch would not be part of the source of PDFBox but 
will still
    allow us to version control the test PDFs (it also simplifies the workflow 
for adding
    new test PDFs and new known-good renderings: simply do an "svn add”).

    As far as copyright and licensing is concerned we can put any PDF files 
which are
    available publicly on the web into this branch without too much worry.

What does everybody think?

-- John



Reply via email to