[
https://issues.apache.org/jira/browse/PDFBOX-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012127#comment-16012127
]
Tilman Hausherr commented on PDFBOX-3792:
-----------------------------------------
What does work is a change in the source code of PDSimpleFont.java:
{code}
if (encoding != null)
{
name = encoding.getName(code);
unicode = unicodeGlyphList.toUnicode(name);
if (unicode != null)
{
return unicode;
}
// this segment is new
if (name.matches("C\\d\\d\\d\\d"))
{
unicode = new String(new byte[]{ (byte)
Integer.parseInt(name.substring(1)) });
return unicode;
}
}
{code}
But this would make sense only if you have many files all created by PDFdo.com.
And if the files are created in your own company, then the best would be to
have PDFdo.com correct their bug, or switch to a product that doesn't have that
bug.
> Getting lots of warnings "No Unicode mapping for..." when extract text
> ----------------------------------------------------------------------
>
> Key: PDFBOX-3792
> URL: https://issues.apache.org/jira/browse/PDFBOX-3792
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.5
> Reporter: sunny xia
> Attachments: FileWithIssue.pdf, IssueLog.txt, OutputText.txt
>
>
> When I use PDFbox to extract text, I get lots of warnings and as output I
> only get garbage. But when I use Abode Acrobat to export the attached PDF
> file to text, it works fine. I have attached the original PDF file, the text
> output and the log with warnings. And besides, PDF file seems to have a
> Type-1 font embedded with a custom encoding.I have checked lots of reports on
> JIRA issue tracker, still find no way to solve it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]