On 13.01.2025 14:23, Tilman Hausherr wrote:
On 12.01.2025 16:52, Tilman Hausherr wrote:
I will redo the "A" part and later the "B" part due to the font
installation (thanks).
https://home.snafu.de/tilman/tmp/reports_pdfbox_3.0.3_vs_3.0.4-3.tar.xz
there are some new exceptions, but I assume that these aren't real,
rather some tika or OS problems.
I didn't find any problems that need to be handled. The things I found
have been mentioned before, the superscript problem and the
"spaced"-Problem.
The superscript problem may be solved in the future either by an
algorithm change (don't know if possible) that numbers in front of a
latin word get separated, or by improved strategies about the space
size. Maybe a database of fonts and their space size.
It may also be possible that users want configuration of the /ActualText
feature. Most of the time it improves things, but sometimes it is used
for extraction censorship.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org