Hi John,

thanks for binging this up. This is a very important topic which was also 
discussed at the PDFDays in Germany.

 # Tests #
In addition to rendering we shall be covering metadata and text extraction as 
well as PDF/A validation. 

# Testfiles # 
Recently there were a number of test sets made available which we can use. 
http://digitalcorpora.org/corpora/files , 
https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors …
For PDF/A validation there is the Isartor test suite 
http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions 
apply there.
In addition we can put additional files into our own repository as you 
suggested.
So there is no shortage on test files. 

TIKA-1300/TIKA-1302 has a discussion around the same topic together with some 
development for an infrastructure (VM, Jenkins …). IMHO we should join forces 
with them.

BR

Maruan


Am 04.07.2014 um 02:16 schrieb John Hewson <j...@jahewson.com>:

> Hi All
> 
> I’ve been thinking about regression testing recently and how we can improve
> our tests for rendering. There are currently two problems:
> 
> 1) Different JDKs produce slightly different renderings (see PDFBOX-1843).
>    (I suspect that AWT fonts are a big part of this, so the problem might get 
> a lot better
>    soon once we render all fonts ourselves).
> 
> 2) Most PDF test files we have are not under an Apache-friendly license, so
>    we can’t put the test files into the trunk SVN.
> 
> It seems that some of you have your own collections of test PDF files which 
> you are
> running regression tests on: that’s great but it would be much better if we 
> had a
> central repository of test files and sample renderings.
> 
> I’d like to suggest the following solutions to the above issues:
> 
> 1) We should choose a “blessed” JDK which will be used to perform the 
> renderings
>    this should be whatever is a convenient and sensible default for 
> committers. (My
>    preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has 
> known
>    rendering bugs). We should make sure that Jenkins runs tests using the 
> ”blessed”
>    JDK.
> 
>   The regression test can then check to see if it is running on the “blessed” 
> JDK and
>   if not then the tests can be skipped and we can warn the user.
> 
> 2) We should create a new “regression” branch in SVN which contains only PDF 
> files
>    for testing and PNG images which contain known-good renderings created 
> using the
>    “blessed” JDK. This branch would not be part of the source of PDFBox but 
> will still
>    allow us to version control the test PDFs (it also simplifies the workflow 
> for adding
>    new test PDFs and new known-good renderings: simply do an "svn add”).
> 
>    As far as copyright and licensing is concerned we can put any PDF files 
> which are
>    available publicly on the web into this branch without too much worry.
> 
> What does everybody think?
> 
> -- John
> 

Reply via email to