[jira] [Commented] (PDFBOX-5155) Error extracting text from PDF - Can't read the embedded Type1 font FDFBJU+NewsGothic

Tilman Hausherr (Jira) Wed, 19 May 2021 10:18:04 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347782#comment-17347782
 ]


Tilman Hausherr commented on PDFBOX-5155:
-----------------------------------------

We would have to change the existing workaround related to the "Invalid 
ToUnicode CMap in font" message (in PDFont.loadUnicodeCmap) so that it doesn't 
return an identity CMap instead. That might ruin the extraction of other files.

If this file is one of many different files that you process (maybe you process 
invoices): use OCR. There are many flawed PDF files.
If this file is part of a process where you only get files like this: get the 
creator of that file to fix it, maybe by updating the library that they are 
using. "But it works with Adobe" doesn't mean it is correct.

> Error extracting text from PDF - Can't read the embedded Type1 font 
> FDFBJU+NewsGothic
> -------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5155
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5155
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.22, 2.0.23
>         Environment: Java 11
>            Reporter: nithin nambiar
>            Assignee: Tilman Hausherr
>            Priority: Major
>             Fix For: 2.0.24, 3.0.0 PDFBox
>
>         Attachments: FDFBJU+NewsGothic-0034.pfa, 
> FDFBJU+NewsGothic-Bold-0050.pfa, FDFBJU+NewsGothic-Bold-0050.pfa, Screenshot 
> 2021-04-30 at 12.34.20.png, Screenshot 2021-05-04 at 19.37.12.png, Screenshot 
> 2021-05-04 at 19.37.43.png, Screenshot 2021-05-04 at 19.38.05.png, Screenshot 
> 2021-05-04 at 20.49.24.png, Screenshot 2021-05-06 at 14.57.06.png, 
> image-2021-04-07-17-11-10-048.png, image-2021-04-30-13-22-09-187.png, 
> image-2021-05-01-09-49-26-222.png, image-2021-05-01-12-54-26-202.png, 
> image-2021-05-01-18-07-38-406.png, image-2021-05-04-09-45-53-271.png, 
> image-2021-05-04-09-47-17-536.png, image-2021-05-04-09-47-46-988.png, 
> image-2021-05-04-17-39-26-079.png, image-2021-05-04-17-41-37-186.png
>
>
> When i try to extract text from command line using pdfbox verision 2.0.22 and 
> 2.023 I get the following error. The pdf is customer specific one, I can't 
> share it here. Is this error because this particular font is not supported by 
> pdfbox?
> {code:java}
> Apr 07, 2021 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap 
> WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic Apr 07, 2021 
> 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init> SEVERE: Can't 
> read the embedded Type1 font FDFBJU+NewsGothic java.io.IOException: Expected 
> INTEGER or REAL but got NAME at 
> org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256) at 
> org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168) at 
> org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139) at 
> org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61) at 
> org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85) at 
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:263) at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
>  at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>  at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
>  at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394) 
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322) 
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269) 
> at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at 
> org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at 
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5155) Error extracting text from PDF - Can't read the embedded Type1 font FDFBJU+NewsGothic

Reply via email to