[
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960232#comment-14960232
]
Maruan Sahyoun commented on PDFBOX-2740:
----------------------------------------
It would be good to actually have the document and not only the error messages
to be able to tell if that is an error or correct behavior. For text extraction
the glyph information needs to be translated to a Unicode character. That's why
it's possible that a character might be visible on screen but the text
extraction either doesn't have the char or has a different one to the expected.
{quote}
When extracting character content, a consumer application can easily convert
text to Unicode values if a font’s characters are identified according to a
standard character set that is known to the application. This character
identification can occur if either the font uses a standard named encoding or
the characters in the font are identified by standard character names or CIDs
in a well-known collec- tion. Section 5.9.1, “Mapping Character Codes to
Unicode Values,” describes in detail the overall algorithm for mapping
character codes to Unicode values.
If a font is not defined in one of these ways, the glyphs can still be shown,
but the characters cannot be converted to Unicode values without additional
informa- tion:
• This information can be provided as an optional ToUnicode entry in the font
dictionary (PDF 1.2; see Section 5.9.2, “ToUnicode CMaps”), whose value is a
stream object containing a special kind of CMap file that maps character codes
to Unicode values.
• An ActualText entry for a structure element or marked-content sequence (see
Section 10.8.3, “Replacement Text”) can be used to specify the text content di-
rectly.
{quote}
The error message indicates that the glyph can't be translated to a Unicode
code. wo the original document it's impossible to tell if t should have
found/calculated a mapping (indicating a potential bug) or not.
> Text extraction failed on Korean PDF
> ------------------------------------
>
> Key: PDFBOX-2740
> URL: https://issues.apache.org/jira/browse/PDFBOX-2740
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
> Reporter: Julien Ortega
> Attachments: g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary
> conversion table because every pdf reader (Desktop or Mobile) let me copy and
> past the text without problem.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]