[
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561977#comment-14561977
]
John Hewson commented on PDFBOX-2547:
-------------------------------------
{quote}
So I would propose (patch attached) to use font glyph name to be shown instead
of control character codes in result. Of course with some context "<UNK_169lc>"
to be easily identified and perhaps replaced manually by the user.
{quote}
This PDF really does include those control characters as its text. Replacing
that with anything else is going to worse still, especially as glyph names are
NOT part of the text of a PDF.
What you can do if you want to have custom glyph to text mapping is override
the showGlyph(...) method of PDFTextStripper. Then if the "unicode" parameter
is null, you can perform your own mapping (using the "code" and PDFont) and
then pass that unicode value to super.showGlyph(...).
> maybe encoding error
> --------------------
>
> Key: PDFBOX-2547
> URL: https://issues.apache.org/jira/browse/PDFBOX-2547
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7, 2.0.0
> Reporter: Michał
> Priority: Minor
> Attachments: PDFTextStreamEngine.txt
>
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe'
> (page 4, line 6).
> Maybe it is some small problems.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]