[jira] [Updated] (PDFBOX-2547) maybe encoding error

Jiri Vorac (JIRA) Wed, 27 May 2015 00:37:40 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jiri Vorac updated PDFBOX-2547:
-------------------------------
    Attachment: PDFTextStreamEngine.txt

It seems there is no Unicode information available for mentioned characters in 
the file.

Currently the font character code is directly converted to Unicode (in this 
case chr(2) for both mentioned characters) what makes loss of information.

So I would propose (patch attached) to use font glyph name to be shown instead 
of control character codes in result. Of course with some context "<UNK_169lc>" 
to be easily identified and perhaps replaced manually by the user.

> maybe encoding error
> --------------------
>
>                 Key: PDFBOX-2547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Michał
>            Priority: Minor
>         Attachments: PDFTextStreamEngine.txt
>
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf 
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' 
> (page 4, line 6).
> Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (PDFBOX-2547) maybe encoding error

Reply via email to