[jira] [Commented] (PDFBOX-2547) maybe encoding error

John Hewson (JIRA) Wed, 27 May 2015 16:28:26 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561983#comment-14561983
 ]


John Hewson commented on PDFBOX-2547:
-------------------------------------

Also, if you have known glyph names, e.g. UNK_169lc and you know what the 
Unicode mapping for that is supposed to be, then you can simply add those 
mappings to the glyph list for PDTextStripper, which can be found in 
additional.txt.

> maybe encoding error
> --------------------
>
>                 Key: PDFBOX-2547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Michał
>            Priority: Minor
>         Attachments: PDFTextStreamEngine.txt
>
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf 
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' 
> (page 4, line 6).
> Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2547) maybe encoding error

Reply via email to