[
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jiri Vorac updated PDFBOX-2547:
-------------------------------
Attachment: PDFTextStreamEngine.txt
It seems there is no Unicode information available for mentioned characters in
the file.
Currently the font character code is directly converted to Unicode (in this
case chr(2) for both mentioned characters) what makes loss of information.
So I would propose (patch attached) to use font glyph name to be shown instead
of control character codes in result. Of course with some context "<UNK_169lc>"
to be easily identified and perhaps replaced manually by the user.
> maybe encoding error
> --------------------
>
> Key: PDFBOX-2547
> URL: https://issues.apache.org/jira/browse/PDFBOX-2547
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7, 2.0.0
> Reporter: Michał
> Priority: Minor
> Attachments: PDFTextStreamEngine.txt
>
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe'
> (page 4, line 6).
> Maybe it is some small problems.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]