[ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960219#comment-14960219
 ] 

Ben McCann edited comment on PDFBOX-2740 at 10/16/15 6:04 AM:
--------------------------------------------------------------

I'm seeing this bug on another document. The doc is in English, but in pdfbox 
it just loads as gibberish like ", DP D UHFHQW FROOHJH JUDGXDWH ZLWK"

It's printing output like:

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+74 (74) in font DTZNQG+font0000000015013e02

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+3 (3) in font ZKAQHT+font0000000015013e02

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+3 (3) in font ZKAQHT+font0000000015013e02

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+54 (54) in font OBJREX+font0000000015013e02


The document I'm seeing this in has
Producer: Mac OS X 10.6.8 Quartz PDFContext
Creator: Documill Publishor 6.3.9.1 by Documill (http://www.documill.com/)
Format: PDF-1.3


was (Author: chengas123):
I'm seeing this bug on another document. The doc is in English, but in pdfbox 
it just loads as gibberish like ", DP D UHFHQW FROOHJH JUDGXDWH ZLWK"

It's printing output like:

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+74 (74) in font DTZNQG+font0000000015013e02

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+3 (3) in font ZKAQHT+font0000000015013e02

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+3 (3) in font ZKAQHT+font0000000015013e02

WARN - org.apache.pdfbox.pdmodel.font.PDType0Font - 
No Unicode mapping for CID+54 (54) in font OBJREX+font0000000015013e02

> Text extraction failed on Korean PDF
> ------------------------------------
>
>                 Key: PDFBOX-2740
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2740
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>            Reporter: Julien Ortega
>         Attachments: g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to