Re: PDFBox 2.0.33 release

Andreas Lehmkühler Sun, 12 Jan 2025 04:25:33 -0800



Am 08.01.25 um 04:56 schrieb Tilman Hausherr:

On 07.01.2025 15:00, Tilman Hausherr wrote:
- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 |creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons:2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary textextractions, not even with Tika from the command like. But they appearin the tika extraction JSON file on the machine. I'll try toinvestigate this.
It turns out that it happens with happens with PDFBox ExtractText onlyon the regression test machine. And with rendering too.
The cause is that the machine has no fonts, so our Liberation Sans isused. That font is slightly larger. So instead of rendering like this
We get this

And text extraction uses these positions too.
The appearance of "creatinga" in the "B" column of the excel file isbecause there are 3 more than in the "A" run.
So we should install fonts on the test machine, see

I've tried to do so, but my user isn't a member of the sudo group :-o

@Maruan or @Tim
Please install those missing fonts or add me to the sudo group

Thanks
Andreas


https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Re: PDFBox 2.0.33 release

Reply via email to