Reports are here: https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Looks like extraction improved slightly. I found a bug at the Tika level that is creating a few more exceptions (will fix soon), but this is not a problem for PDFBox. I was able to turn back on our unit test that counted characters and non-unicode mapped characters. I'll look a bit tomorrow, but this looks good to me. Again, many thanks to Maruan! The processing speeds were, um, much, much faster. Best, Tim On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote: > Yes, please > > Thanks in advance! > > Am 28.07.20 um 12:45 schrieb Tim Allison: > > Y. I can run these today > > > > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <andr...@lehmi.de> > > wrote: > > > >> Hi, > >> > >> is there any chance to run the PDFBox regression tests (2.0.20 vs. > >> SNAPSHOT) on > >> our new box? Does anyone had the cycles to prepare something ready to > >> start? > >> > >> If not, is there anything I can do to help? I'm planning to cut a new > >> PDFBox > >> release soon. > >> > >> Cheers > >> Andreas > >> > > > >