[
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561983#comment-14561983
]
John Hewson commented on PDFBOX-2547:
-------------------------------------
Also, if you have known glyph names, e.g. UNK_169lc and you know what the
Unicode mapping for that is supposed to be, then you can simply add those
mappings to the glyph list for PDTextStripper, which can be found in
additional.txt.
> maybe encoding error
> --------------------
>
> Key: PDFBOX-2547
> URL: https://issues.apache.org/jira/browse/PDFBOX-2547
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7, 2.0.0
> Reporter: Michał
> Priority: Minor
> Attachments: PDFTextStreamEngine.txt
>
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe'
> (page 4, line 6).
> Maybe it is some small problems.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]