[
https://issues.apache.org/jira/browse/PDFBOX-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339189#comment-17339189
]
Tilman Hausherr commented on PDFBOX-5155:
-----------------------------------------
Maybe we're getting closer now. The text on the top right should be
"DictionaryEncoding with differences" but it's just "DictionaryEncoding".
The glyphname is the correct one. Could you please post another screenshot that
shows the encoding Differences array of F15?
I'm wondering if the differences array has a flaw, or whether we ignore the
differences because of the bad /ToUnicode stream.
> Error extracting text from PDF - Can't read the embedded Type1 font
> FDFBJU+NewsGothic
> -------------------------------------------------------------------------------------
>
> Key: PDFBOX-5155
> URL: https://issues.apache.org/jira/browse/PDFBOX-5155
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.22, 2.0.23
> Environment: Java 11
> Reporter: nithin nambiar
> Assignee: Tilman Hausherr
> Priority: Major
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: FDFBJU+NewsGothic-0034.pfa,
> FDFBJU+NewsGothic-Bold-0050.pfa, FDFBJU+NewsGothic-Bold-0050.pfa, Screenshot
> 2021-04-30 at 12.34.20.png, image-2021-04-07-17-11-10-048.png,
> image-2021-04-30-13-22-09-187.png, image-2021-05-01-09-49-26-222.png,
> image-2021-05-01-12-54-26-202.png, image-2021-05-01-18-07-38-406.png,
> image-2021-05-04-09-45-53-271.png, image-2021-05-04-09-47-17-536.png,
> image-2021-05-04-09-47-46-988.png, image-2021-05-04-17-39-26-079.png,
> image-2021-05-04-17-41-37-186.png
>
>
> When i try to extract text from command line using pdfbox verision 2.0.22 and
> 2.023 I get the following error. The pdf is customer specific one, I can't
> share it here. Is this error because this particular font is not supported by
> pdfbox?
> {code:java}
> Apr 07, 2021 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic Apr 07, 2021
> 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init> SEVERE: Can't
> read the embedded Type1 font FDFBJU+NewsGothic java.io.IOException: Expected
> INTEGER or REAL but got NAME at
> org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256) at
> org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168) at
> org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139) at
> org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61) at
> org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85) at
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:263) at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
> at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
> at
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322)
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269)
> at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at
> org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]