Am 12.01.25 um 13:58 schrieb sahy...@fileaffairs.de:
Am Sonntag, dem 12.01.2025 um 13:24 +0100 schrieb Andreas Lehmkühler:
Am 08.01.25 um 04:56 schrieb Tilman Hausherr:
On 07.01.2025 15:00, Tilman Hausherr wrote:
- mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 |
creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 |
ons:
2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary
text
extractions, not even with Tika from the command like. But they
appear
in the tika extraction JSON file on the machine. I'll try to
investigate this.
It turns out that it happens with happens with PDFBox ExtractText
only
on the regression test machine. And with rendering too.
The cause is that the machine has no fonts, so our Liberation Sans
is
used. That font is slightly larger. So instead of rendering like
this
We get this
And text extraction uses these positions too.
The appearance of "creatinga" in the "B" column of the excel file
is
because there are 3 more than in the "A" run.
So we should install fonts on the test machine, see
I've tried to do so, but my user isn't a member of the sudo group :-o
I added you to the group - please give it a try and let me know if
there are issues
Thanks for the fast response. Works like a charm. I've installed the
mentioned ttf-mscorefonts-installer package, so that the missing fonts
are available now.
Hope this helps
Andreas
BR
Maruan
@Maruan or @Tim
Please install those missing fonts or add me to the sudo group
Thanks
Andreas
https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org