Re: Xmas 2.0.22 Release?

Tilman Hausherr Fri, 11 Dec 2020 12:09:59 -0800

Am 11.12.2020 um 20:13 schrieb [email protected]:

Am Freitag, den 11.12.2020, 14:58 +0100 schrieb Tilman Hausherr:

The exceptions are mostly about the acroform fixup.
This fails when the font can't be used.


bug_trackers/PDFBOX/PDFBOX-4086-0.pdf
bug_trackers/PDFBOX/PDFBOX-4086-1.pdf
bug_trackers/PDFBOX/PDFBOX-4086-2.pdf
bug_trackers/PDFBOX/PDFBOX-3587-0.zip-5.pdf
bug_trackers/PDFBOX/PDFBOX-3642-0.pdf

they should be fixed now.



Thanks!


However I wonder if Tika should also be changed: it doesn't need the
appearances for text extraction. However it could use the field
repair.

would be benefitial - that's also the reason why there are multiple
processors with a single purpose.



Yeah, I've created a Tika issue as well.

In the meantime I had a look at the content differences. First I sortedby the tokens decrease, then looked at the TOP_10_UNIQUE_TOKEN_DIFFS_A,any "full" words there would be suspicious. Turns out that there werenone relevant. These were improvements, likely thanks to PDFBOX-5002. Ilooked at the history of that user, he had submitted only one otherissue + patch in 2016, and I wrote that it was a "high quality patch" 😂

Tomorrow or sunday I'll sort by NUM_COMMON_TOKENS_DIFF_IN_B, and doanother test


Tilman

Tilman


Am 11.12.2020 um 13:07 schrieb Tilman Hausherr:

I had a quick look
- 32 new exceptions
- content is a bit better, for NUM_COMMON_TOKENS the new version
extracts 100.41% of the old one.

Tilman

Am 11.12.2020 um 13:04 schrieb Tilman Hausherr:

http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.21_vs_2.0.22.tar.xz




-------------------------------------------------------------------
--
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Xmas 2.0.22 Release?

Reply via email to