Re: PDFBox 2.0.33 release

Tilman Hausherr Tue, 07 Jan 2025 19:57:01 -0800

On 07.01.2025 15:00, Tilman Hausherr wrote:

- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 |creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons:2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary textextractions, not even with Tika from the command like. But they appearin the tika extraction JSON file on the machine. I'll try toinvestigate this.

It turns out that it happens with happens with PDFBox ExtractText onlyon the regression test machine. And with rendering too.

The cause is that the machine has no fonts, so our Liberation Sans isused. That font is slightly larger. So instead of rendering like this


We get this

And text extraction uses these positions too.

The appearance of "creatinga" in the "B" column of the excel file isbecause there are 3 more than in the "A" run.


So we should install fonts on the test machine, see

https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F

Tilman

Re: PDFBox 2.0.33 release

Reply via email to