[jira] [Commented] (PDFBOX-2740) Text extraction failed on Korean PDF

Maruan Sahyoun (JIRA) Thu, 15 Oct 2015 23:16:52 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960232#comment-14960232
 ]


Maruan Sahyoun commented on PDFBOX-2740:
----------------------------------------

It would be good to actually have the document and not only the error messages 
to be able to tell if that is an error or correct behavior. For text extraction 
the glyph information needs to be translated to a Unicode character. That's why 
it's possible that a character might be visible on screen but the text 
extraction either doesn't have the char or has a different one to the expected.

{quote}
When extracting character content, a consumer application can easily convert 
text to Unicode values if a font’s characters are identified according to a 
standard character set that is known to the application. This character 
identification can occur if either the font uses a standard named encoding or 
the characters in the font are identified by standard character names or CIDs 
in a well-known collec- tion. Section 5.9.1, “Mapping Character Codes to 
Unicode Values,” describes in detail the overall algorithm for mapping 
character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown, 
but the characters cannot be converted to Unicode values without additional 
informa- tion:
• This information can be provided as an optional ToUnicode entry in the font 
dictionary (PDF 1.2; see Section 5.9.2, “ToUnicode CMaps”), whose value is a 
stream object containing a special kind of CMap file that maps character codes 
to Unicode values.
• An ActualText entry for a structure element or marked-content sequence (see 
Section 10.8.3, “Replacement Text”) can be used to specify the text content di- 
rectly.
{quote}

The error message indicates that the glyph can't be translated to a Unicode 
code. wo the original document it's impossible to tell if t should have 
found/calculated a mapping (indicating a potential bug) or not.

> Text extraction failed on Korean PDF
> ------------------------------------
>
>                 Key: PDFBOX-2740
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2740
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>            Reporter: Julien Ortega
>         Attachments: g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2740) Text extraction failed on Korean PDF

Reply via email to