Am 08.01.25 um 04:56 schrieb Tilman Hausherr:
On 07.01.2025 15:00, Tilman Hausherr wrote:
- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons: 2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary text extractions, not even with Tika from the command like. But they appear in the tika extraction JSON file on the machine. I'll try to investigate this.

It turns out that it happens with happens with PDFBox ExtractText only on the regression test machine. And with rendering too.

The cause is that the machine has no fonts, so our Liberation Sans is used. That font is slightly larger. So instead of rendering like this

We get this

And text extraction uses these positions too.

The appearance of "creatinga" in the "B" column of the excel file is because there are 3 more than in the "A" run.

So we should install fonts on the test machine, see
I've tried to do so, but my user isn't a member of the sudo group :-o

@Maruan or @Tim
Please install those missing fonts or add me to the sudo group

Thanks
Andreas

https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to