[
https://issues.apache.org/jira/browse/PDFBOX-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349003#comment-17349003
]
Andreas Lehmkühler commented on PDFBOX-5155:
--------------------------------------------
There is maybe one thing we could try. According to the spec there should be
one of the given mappings (encoding or toUnicode map) but it doesn't define any
priority. Saying that we could try to simply change the order of processing
within org.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(int, GlyphList)
and see what happens. The tests are working and I've checked some other pdfs as
well.
> Error extracting text from PDF - Can't read the embedded Type1 font
> FDFBJU+NewsGothic
> -------------------------------------------------------------------------------------
>
> Key: PDFBOX-5155
> URL: https://issues.apache.org/jira/browse/PDFBOX-5155
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.22, 2.0.23
> Environment: Java 11
> Reporter: nithin nambiar
> Assignee: Tilman Hausherr
> Priority: Major
> Fix For: 2.0.24, 3.0.0 PDFBox
>
> Attachments: FDFBJU+NewsGothic-0034.pfa,
> FDFBJU+NewsGothic-Bold-0050.pfa, FDFBJU+NewsGothic-Bold-0050.pfa, Screenshot
> 2021-04-30 at 12.34.20.png, Screenshot 2021-05-04 at 19.37.12.png, Screenshot
> 2021-05-04 at 19.37.43.png, Screenshot 2021-05-04 at 19.38.05.png, Screenshot
> 2021-05-04 at 20.49.24.png, Screenshot 2021-05-06 at 14.57.06.png,
> image-2021-04-07-17-11-10-048.png, image-2021-04-30-13-22-09-187.png,
> image-2021-05-01-09-49-26-222.png, image-2021-05-01-12-54-26-202.png,
> image-2021-05-01-18-07-38-406.png, image-2021-05-04-09-45-53-271.png,
> image-2021-05-04-09-47-17-536.png, image-2021-05-04-09-47-46-988.png,
> image-2021-05-04-17-39-26-079.png, image-2021-05-04-17-41-37-186.png
>
>
> When i try to extract text from command line using pdfbox verision 2.0.22 and
> 2.023 I get the following error. The pdf is customer specific one, I can't
> share it here. Is this error because this particular font is not supported by
> pdfbox?
> {code:java}
> Apr 07, 2021 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic Apr 07, 2021
> 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init> SEVERE: Can't
> read the embedded Type1 font FDFBJU+NewsGothic java.io.IOException: Expected
> INTEGER or REAL but got NAME at
> org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256) at
> org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168) at
> org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139) at
> org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61) at
> org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85) at
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:263) at
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
> at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
> at
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
> at
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394)
> at
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322)
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269)
> at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at
> org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]