On 07.01.2025 15:00, Tilman Hausherr wrote:
- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons: 2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary text extractions, not even with Tika from the command like. But they appear in the tika extraction JSON file on the machine. I'll try to investigate this.

It turns out that it happens with happens with PDFBox ExtractText only on the regression test machine. And with rendering too.

The cause is that the machine has no fonts, so our Liberation Sans is used. That font is slightly larger. So instead of rendering like this

We get this

And text extraction uses these positions too.

The appearance of "creatinga" in the "B" column of the excel file is because there are 3 more than in the "A" run.

So we should install fonts on the test machine, see

https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F

Tilman

Reply via email to