[jira] [Commented] (PDFBOX-5155) Erro extracting text from pdf - Can't read the embedded Type1 font FDFBJU+NewsGothic

Michael Klink (Jira) Mon, 12 Apr 2021 08:30:37 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17319510#comment-17319510
 ]


Michael Klink commented on PDFBOX-5155:
---------------------------------------

{quote}I can extract the text with Adobe Reader without any issues. With pdfbox 
the output text contains gibberish.
{quote}
PDFBox text extraction only relies on mapping arguments of text drawing 
instructions to Unicode. Adobe Reader also supports text extraction via the 
*ActualText* entries for structure elements or marked-content sequences. Thus, 
there simply are situations in which Adobe Reader text extraction returns 
something else than PDFBox text extraction. Whether Adobe Reader or PDFBox 
returns the expected result, depends on the PDF in question, it might befumble 
either tool.

Thus, please check whether Adobe Reader actually returns a different result 
because of such *ActualText* entries or really because of better font parsing.

> Erro extracting text from pdf - Can't read the embedded Type1 font 
> FDFBJU+NewsGothic
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5155
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5155
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.22, 2.0.23
>         Environment: Java 11
>            Reporter: nithin nambiar
>            Priority: Major
>         Attachments: image-2021-04-07-17-11-10-048.png
>
>
> When i try to extract text from command line using pdfbox verision 2.0.22 and 
> 2.023 I get the following error. The pdf is customer specific one, I can't 
> share it here. Is this error because this particular font is not supported by 
> pdfbox?
> {code:java}
> Apr 07, 2021 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap 
> WARNING: Invalid ToUnicode CMap in font FDFBJU+NewsGothic Apr 07, 2021 
> 1:55:06 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init> SEVERE: Can't 
> read the embedded Type1 font FDFBJU+NewsGothic java.io.IOException: Expected 
> INTEGER or REAL but got NAME at 
> org.apache.fontbox.type1.Type1Parser.arrayToNumbers(Type1Parser.java:256) at 
> org.apache.fontbox.type1.Type1Parser.readSimpleValue(Type1Parser.java:168) at 
> org.apache.fontbox.type1.Type1Parser.parseASCII(Type1Parser.java:139) at 
> org.apache.fontbox.type1.Type1Parser.parse(Type1Parser.java:61) at 
> org.apache.fontbox.type1.Type1Font.createWithSegments(Type1Font.java:85) at 
> org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:263) at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:76)
>  at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at 
> org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:933)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:515)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:489)
>  at 
> org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:156)
>  at 
> org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:144)
>  at 
> org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394) 
> at 
> org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:322) 
> at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:269) 
> at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:377) at 
> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:274) at 
> org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97) at 
> org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5155) Erro extracting text from pdf - Can't read the embedded Type1 font FDFBJU+NewsGothic

Reply via email to