Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz

Looks like extraction improved slightly.  I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.

I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.

I'll look a bit tomorrow, but this looks good to me.

Again, many thanks to Maruan!  The processing speeds were, um, much, much
faster.

Best,

       Tim

On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <andr...@lehmi.de>
wrote:

> Yes, please
>
> Thanks in advance!
>
> Am 28.07.20 um 12:45 schrieb Tim Allison:
> > Y. I can run these today
> >
> > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <andr...@lehmi.de>
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> >> SNAPSHOT) on
> >> our new box? Does anyone had the cycles to prepare something ready to
> >> start?
> >>
> >> If not, is there anything I can do to help? I'm planning to cut a new
> >> PDFBox
> >> release soon.
> >>
> >> Cheers
> >> Andreas
> >>
> >
>
>

Reply via email to