On 07.01.2025 14:10, Tilman Hausherr wrote:
latest:
https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33-6.tar.xz
So this is pretty good now. Here's what I found:
- superscript degradation ("1 coupled" becomes "1coupled"): annoying,
but should be solved separately some day with an algorithm improvement.
Having correct space detection in ordinary texts has a higher priority.
- spaced texts degradation ("METAMORPHOSE" becomes "M E T A M O R P H O
S E"): that's because these texts look like that in the original.
- angled degradation: these are differences, but both extractions are
bad. That's what the angle option is for (maybe use this option in the
future?)
- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | creatinga:
3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons: 2 | 0or: 1",
"creatinga" and "anand" DO NOT APPEAR in ordinary text extractions, not
even with Tika from the command like. But they appear in the tika
extraction JSON file on the machine. I'll try to investigate this.
- PDFBOX-5384 - we'll probably need more time for that one.
Besides that, lots of improvements, and the tests really helped finding
the flaws in PDFBOX-5920.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org