On 13.01.2025 14:23, Tilman Hausherr wrote:
On 12.01.2025 16:52, Tilman Hausherr wrote:
I will redo the "A" part and later the "B" part due to the font installation (thanks).

https://home.snafu.de/tilman/tmp/reports_pdfbox_3.0.3_vs_3.0.4-3.tar.xz

there are some new exceptions, but I assume that these aren't real, rather some tika or OS problems.

I didn't find any problems that need to be handled. The things I found have been mentioned before, the superscript problem and the "spaced"-Problem.

The superscript problem may be solved in the future either by an algorithm change (don't know if possible) that numbers in front of a latin word get separated, or by improved strategies about the space size. Maybe a database of fonts and their space size.

It may also be possible that users want configuration of the /ActualText feature. Most of the time it improves things, but sometimes it is used for extraction censorship.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to