Am 08.01.25 um 04:56 schrieb Tilman Hausherr:
On 07.01.2025 15:00, Tilman Hausherr wrote:
- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 |
creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | ons:
2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary text
extractions, not even with Tika from the command like. But they appear
in the tika extraction JSON file on the machine. I'll try to
investigate this.
It turns out that it happens with happens with PDFBox ExtractText only
on the regression test machine. And with rendering too.
The cause is that the machine has no fonts, so our Liberation Sans is
used. That font is slightly larger. So instead of rendering like this
We get this
And text extraction uses these positions too.
The appearance of "creatinga" in the "B" column of the excel file is
because there are 3 more than in the "A" run.
So we should install fonts on the test machine, see
I've tried to do so, but my user isn't a member of the sudo group :-o
@Maruan or @Tim
Please install those missing fonts or add me to the sudo group
Thanks
Andreas
https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org