Am Sonntag, dem 12.01.2025 um 13:24 +0100 schrieb Andreas Lehmkühler: > > > Am 08.01.25 um 04:56 schrieb Tilman Hausherr: > > On 07.01.2025 15:00, Tilman Hausherr wrote: > > > - mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | > > > creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 | > > > ons: > > > 2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary > > > text > > > extractions, not even with Tika from the command like. But they > > > appear > > > in the tika extraction JSON file on the machine. I'll try to > > > investigate this. > > > > It turns out that it happens with happens with PDFBox ExtractText > > only > > on the regression test machine. And with rendering too. > > > > The cause is that the machine has no fonts, so our Liberation Sans > > is > > used. That font is slightly larger. So instead of rendering like > > this > > > > We get this > > > > And text extraction uses these positions too. > > > > The appearance of "creatinga" in the "B" column of the excel file > > is > > because there are 3 more than in the "A" run. > > > > So we should install fonts on the test machine, see > I've tried to do so, but my user isn't a member of the sudo group :-o
I added you to the group - please give it a try and let me know if there are issues BR Maruan > > @Maruan or @Tim > Please install those missing fonts or add me to the sudo group > > Thanks > Andreas > > > > https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F > > > > Tilman > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: dev-h...@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org