[jira] [Commented] (PDFBOX-3792) Getting lots of warnings "No Unicode mapping for..." when extract text

Tilman Hausherr (JIRA) Tue, 16 May 2017 03:44:27 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012127#comment-16012127
 ]


Tilman Hausherr commented on PDFBOX-3792:
-----------------------------------------

What does work is a change in the source code of PDSimpleFont.java:
{code}
        if (encoding != null)
        {
            name = encoding.getName(code);
            unicode = unicodeGlyphList.toUnicode(name);
            if (unicode != null)
            {
                return unicode;
            }
            // this segment is new
            if (name.matches("C\\d\\d\\d\\d"))
            {
                unicode = new String(new byte[]{ (byte) 
Integer.parseInt(name.substring(1)) });
                return unicode;
            }
        }
{code}
But this would make sense only if you have many files all created by PDFdo.com.

And if the files are created in your own company, then the best would be to 
have PDFdo.com correct their bug, or switch to a product that doesn't have that 
bug.

> Getting lots of warnings "No Unicode mapping for..." when extract text
> ----------------------------------------------------------------------
>
>                 Key: PDFBOX-3792
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3792
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.5
>            Reporter: sunny xia
>         Attachments: FileWithIssue.pdf, IssueLog.txt, OutputText.txt
>
>
> When I use PDFbox to extract text, I get lots of warnings and as output I 
> only get garbage. But when I use Abode Acrobat to export the attached PDF 
> file to text, it works fine. I have attached the original PDF file, the text 
> output and the log with warnings. And besides, PDF file seems to  have a 
> Type-1 font embedded with a custom encoding.I have checked lots of reports on 
> JIRA issue tracker, still find no way to solve it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3792) Getting lots of warnings "No Unicode mapping for..." when extract text

Reply via email to