[jira] [Closed] (PDFBOX-1244) the text content extracted by PDFBOX is not as the same as it is displayed in Adobe reader

JIRA Thu, 23 Oct 2014 10:12:07 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andreas Lehmkühler closed PDFBOX-1244.
--------------------------------------
    Resolution: Not a Problem
      Assignee: Andreas Lehmkühler

PDFBox extracts the very same text than the acrobat reader. And yes it's not 
the displayed text, which leads to the assumption that the toUnicode mapping of 
the pdf is broken. 

Closed as "Not a problem"



> the text content extracted by PDFBOX is not as the same as it is displayed in 
> Adobe reader
> ------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1244
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1244
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0, 1.7.0
>         Environment: windows xp, Eclipse 3.2.0
>            Reporter: huangchangan
>            Assignee: Andreas Lehmkühler
>         Attachments: P020101210619863754780 214.pdf
>
>
> Hello, 
> I useed pdfbox extract text content from the PDF document in the appendix, 
> founded the extracted text is "年预" but the text displayed in Adobe reader is 
> "年期".  I want to know how to get the correct text content (as Adobe reader 
> showing) from this kind of PDF documents by PDFBOX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (PDFBOX-1244) the text content extracted by PDFBOX is not as the same as it is displayed in Adobe reader

Reply via email to