On 07.01.2025 14:10, Tilman Hausherr wrote:
latest:

https://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.32_vs_2.0.33-6.tar.xz

So this is pretty good now. Here's what I found:

- superscript degradation ("1 coupled" becomes "1coupled"): annoying, but should be solved separately some day with an algorithm improvement. Having correct space detection in ordinary texts has a higher priority.

- spaced texts degradation ("METAMORPHOSE" becomes "M E T A M O R P H O S E"): that's because these texts look like that in the original.

- angled degradation: these are differences, but both extractions are bad. That's what the angle option is for (maybe use this option in the future?)

- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons: 2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary text extractions, not even with Tika from the command like. But they appear in the tika extraction JSON file on the machine. I'll try to investigate this.

- PDFBOX-5384 - we'll probably need more time for that one.

Besides that, lots of improvements, and the tests really helped finding the flaws in PDFBOX-5920.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to