[ 
https://issues.apache.org/jira/browse/PDFBOX-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316501#comment-17316501
 ] 

Tilman Hausherr commented on PDFBOX-5155:
-----------------------------------------

Sadly this screenshot doesn't help. What would help would be the file, or at 
least the font. You can extract that one with PDFDebugger. After opening the 
file, open the "tree", look for resources, then fonts, then for that font, and 
go further down until you see something named "fontfile" (can also be 2 or 3). 
Then right-click and save decompressed. But even then, it is unlikely that this 
will be shown to be a PDFBox bug.

> Erro extracting text from pdf - Can't read the embedded Type1 font 
> FDFBJU+NewsGothic
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5155
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5155
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.22, 2.0.23
>         Environment: Java 11
>            Reporter: nithin nambiar
>            Priority: Major
>         Attachments: image-2021-04-07-17-11-10-048.png
>
>
> When i try to extract text from command line using pdfbox verision 2.0.22 and 
> 2.023 I get the following error. The pdf is customer specific one, I can't 
> share it here. Is this error because this particular font is not supported by 
> pdfbox?
> {code:java}
> Apr 07, 2021 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap 
> WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic Apr 07, 2021 
> 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init> SEVERE: Can't 
> read the embedded Type1 font FDFBJU+NewsGothic java.io.IOException: Expected 
> INTEGER or REAL but got NAME at 
> org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256) at 
> org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168) at 
> org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139) at 
> org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61) at 
> org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85) at 
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:263) at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
>  at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>  at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
>  at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394) 
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322) 
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269) 
> at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at 
> org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at 
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to