Hi John, thanks for binging this up. This is a very important topic which was also discussed at the PDFDays in Germany.
# Tests # In addition to rendering we shall be covering metadata and text extraction as well as PDF/A validation. # Testfiles # Recently there were a number of test sets made available which we can use. http://digitalcorpora.org/corpora/files , https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors … For PDF/A validation there is the Isartor test suite http://www.pdfa.org/2011/08/download-isartor-test-suite/. Some restrictions apply there. In addition we can put additional files into our own repository as you suggested. So there is no shortage on test files. TIKA-1300/TIKA-1302 has a discussion around the same topic together with some development for an infrastructure (VM, Jenkins …). IMHO we should join forces with them. BR Maruan Am 04.07.2014 um 02:16 schrieb John Hewson <j...@jahewson.com>: > Hi All > > I’ve been thinking about regression testing recently and how we can improve > our tests for rendering. There are currently two problems: > > 1) Different JDKs produce slightly different renderings (see PDFBOX-1843). > (I suspect that AWT fonts are a big part of this, so the problem might get > a lot better > soon once we render all fonts ourselves). > > 2) Most PDF test files we have are not under an Apache-friendly license, so > we can’t put the test files into the trunk SVN. > > It seems that some of you have your own collections of test PDF files which > you are > running regression tests on: that’s great but it would be much better if we > had a > central repository of test files and sample renderings. > > I’d like to suggest the following solutions to the above issues: > > 1) We should choose a “blessed” JDK which will be used to perform the > renderings > this should be whatever is a convenient and sensible default for > committers. (My > preference would be for Oracle’s JDK 7 because JDK 6 is deprecated has > known > rendering bugs). We should make sure that Jenkins runs tests using the > ”blessed” > JDK. > > The regression test can then check to see if it is running on the “blessed” > JDK and > if not then the tests can be skipped and we can warn the user. > > 2) We should create a new “regression” branch in SVN which contains only PDF > files > for testing and PNG images which contain known-good renderings created > using the > “blessed” JDK. This branch would not be part of the source of PDFBox but > will still > allow us to version control the test PDFs (it also simplifies the workflow > for adding > new test PDFs and new known-good renderings: simply do an "svn add”). > > As far as copyright and licensing is concerned we can put any PDF files > which are > available publicly on the web into this branch without too much worry. > > What does everybody think? > > -- John >