[jira] [Comment Edited] (PDFBOX-2740) Text extraction failed on Korean PDF

John Hewson (JIRA) Tue, 23 Feb 2016 11:20:48 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159473#comment-15159473
 ]


John Hewson edited comment on PDFBOX-2740 at 2/23/16 7:20 PM:
--------------------------------------------------------------

The difference between copy & paste and save as in Acrobat which Maruan noticed 
reminded me of an issue we had before where the same thing was happening. In 
that case it turned out that the text which was extracted via copy and paste 
was the accessible text in the "marked content", rather than the text from the 
content stream.

Looking at this PDF, we see the same thing:

{code}
/Span << /ActualText (\376\377\263\304) >> BDC
  BT
    /F33 1 Tf
    8.4647 0 0 8.4647 392.0664 324.946 Tm
    (^) Tj
  ET
EMC
{code}

We've seen cases in the past where marked content actually contains bad text, 
while the content stream contains good text, so we don't extract text from 
marked content, neither do most PDF viewers. We do provide 
PDFMarkedContentExtractor for those wanting to extract marked content only.

[~chengas123], your problem may or may not be the same - open your problem PDF 
with PDFDebugger and search the Contents stream for {{/ActualText}}. If you see 
it, then you have the same problem. Otherwise, feel free to open a new issue 
if/when you have a PDF which you can post.


was (Author: jahewson):
The difference between copy & paste and save as in Acrobat which Maruan noticed 
reminded me of an issue we had before where the same thing was happening. In 
that case it turned out that the text which was extracted via copy and paste 
was the accessible text in the "marked content", rather than the text from the 
content stream.

Looking at this PDF, we see the same thing:

{code}
/Span << /ActualText (\376\377\263\304) >> BDC
  BT
    /F33 1 Tf
    8.4647 0 0 8.4647 392.0664 324.946 Tm
    (^) Tj
  ET
EMC
{code}

We've seen cases in the past where marked content actually contains bad text, 
while the content stream contains good text, so we don't extract text from 
marked content, neither do most PDF viewers. We do provide 
PDFMarkedContentExtractor for those wanting to extract marked content only.

[~chengas123], your problem may or may not be the same - open your problem PDF 
with PDFDebugger and search the Contents stream for {/ActualText}.

> Text extraction failed on Korean PDF
> ------------------------------------
>
>                 Key: PDFBOX-2740
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2740
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>            Reporter: Julien Ortega
>         Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, 
> g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font 
> DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for NAK (33) in font 
> JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for RS (38) in font 
> WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for DEL (33) in font 
> FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont 
> toUnicode
> WARNING: No Unicode mapping for SOH (33) in font 
> OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary 
> conversion table because every pdf reader (Desktop or Mobile) let me copy and 
> past the text without problem.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2740) Text extraction failed on Korean PDF

Reply via email to