[jira] [Commented] (PDFBOX-2547) maybe encoding error

John Hewson (JIRA) Wed, 27 May 2015 16:25:15 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561977#comment-14561977
 ]


John Hewson commented on PDFBOX-2547:
-------------------------------------

{quote}
So I would propose (patch attached) to use font glyph name to be shown instead 
of control character codes in result. Of course with some context "<UNK_169lc>" 
to be easily identified and perhaps replaced manually by the user.
{quote}

This PDF really does include those control characters as its text. Replacing 
that with anything else is going to worse still, especially as glyph names are 
NOT part of the text of a PDF.

What you can do if you want to have custom glyph to text mapping is override 
the showGlyph(...) method of PDFTextStripper. Then if the "unicode" parameter 
is null, you can perform your own mapping (using the "code" and PDFont) and 
then pass that unicode value to super.showGlyph(...).

> maybe encoding error
> --------------------
>
>                 Key: PDFBOX-2547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Michał
>            Priority: Minor
>         Attachments: PDFTextStreamEngine.txt
>
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf 
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' 
> (page 4, line 6).
> Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2547) maybe encoding error

Reply via email to