The exceptions are mostly about the acroform fixup.
This fails when the font can't be used.

bug_trackers/PDFBOX/PDFBOX-4086-0.pdf
bug_trackers/PDFBOX/PDFBOX-4086-1.pdf
bug_trackers/PDFBOX/PDFBOX-4086-2.pdf
bug_trackers/PDFBOX/PDFBOX-3587-0.zip-5.pdf
bug_trackers/PDFBOX/PDFBOX-3642-0.pdf


However I wonder if Tika should also be changed: it doesn't need the appearances for text extraction. However it could use the field repair.

Tilman


Am 11.12.2020 um 13:07 schrieb Tilman Hausherr:
I had a quick look
- 32 new exceptions
- content is a bit better, for NUM_COMMON_TOKENS the new version extracts 100.41% of the old one.

Tilman

Am 11.12.2020 um 13:04 schrieb Tilman Hausherr:
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.21_vs_2.0.22.tar.xz




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


Reply via email to