Am Sonntag, dem 12.01.2025 um 13:24 +0100 schrieb Andreas Lehmkühler:
> 
> 
> Am 08.01.25 um 04:56 schrieb Tilman Hausherr:
> > On 07.01.2025 15:00, Tilman Hausherr wrote:
> > > - mysterious: govdocs1/838/838013.pdf has "ion: 4 | name: 4 | 
> > > creatinga: 3 | ram: 3 | anand: 2 | jec: 2 | message: 2 | oc: 2 |
> > > ons: 
> > > 2 | 0or: 1", "creatinga" and "anand" DO NOT APPEAR in ordinary
> > > text 
> > > extractions, not even with Tika from the command like. But they
> > > appear 
> > > in the tika extraction JSON file on the machine. I'll try to 
> > > investigate this. 
> > 
> > It turns out that it happens with happens with PDFBox ExtractText
> > only 
> > on the regression test machine. And with rendering too.
> > 
> > The cause is that the machine has no fonts, so our Liberation Sans
> > is 
> > used. That font is slightly larger. So instead of rendering like
> > this
> > 
> > We get this
> > 
> > And text extraction uses these positions too.
> > 
> > The appearance of "creatinga" in the "B" column of the excel file
> > is 
> > because there are 3 more than in the "A" run.
> > 
> > So we should install fonts on the test machine, see
> I've tried to do so, but my user isn't a member of the sudo group :-o

I added you to the group - please give it a try and let me know if
there are issues

BR
Maruan

> 
> @Maruan or @Tim
> Please install those missing fonts or add me to the sudo group
> 
> Thanks
> Andreas
> > 
> > https://pdfbox.apache.org/3.0/faq.html#what-fonts-do-i-need-on-my-system%3F
> > 
> > Tilman
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: dev-h...@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to