Am 11.12.2020 um 20:13 schrieb [email protected]:
Am Freitag, den 11.12.2020, 14:58 +0100 schrieb Tilman Hausherr:
The exceptions are mostly about the acroform fixup.
This fails when the font can't be used.

bug_trackers/PDFBOX/PDFBOX-4086-0.pdf
bug_trackers/PDFBOX/PDFBOX-4086-1.pdf
bug_trackers/PDFBOX/PDFBOX-4086-2.pdf
bug_trackers/PDFBOX/PDFBOX-3587-0.zip-5.pdf
bug_trackers/PDFBOX/PDFBOX-3642-0.pdf
they should be fixed now.


Thanks!



However I wonder if Tika should also be changed: it doesn't need the
appearances for text extraction. However it could use the field
repair.
would be benefitial - that's also the reason why there are multiple
processors with a single purpose.


Yeah, I've created a Tika issue as well.

In the meantime I had a look at the content differences. First I sorted by the tokens decrease, then looked at the TOP_10_UNIQUE_TOKEN_DIFFS_A, any "full" words there would be suspicious. Turns out that there were none relevant. These were improvements, likely thanks to PDFBOX-5002. I looked at the history of that user, he had submitted only one other issue + patch in 2016, and I wrote that it was a "high quality patch" 😂

Tomorrow or sunday I'll sort by NUM_COMMON_TOKENS_DIFF_IN_B, and do another test

Tilman




Tilman


Am 11.12.2020 um 13:07 schrieb Tilman Hausherr:
I had a quick look
- 32 new exceptions
- content is a bit better, for NUM_COMMON_TOKENS the new version
extracts 100.41% of the old one.

Tilman

Am 11.12.2020 um 13:04 schrieb Tilman Hausherr:
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.21_vs_2.0.22.tar.xz



-------------------------------------------------------------------
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to