[jira] [Commented] (PDFBOX-2547) maybe encoding error

John Hewson (JIRA) Mon, 08 Dec 2014 12:13:28 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238407#comment-14238407
 ]


John Hewson commented on PDFBOX-2547:
-------------------------------------

Text extraction does of this PDF does not produce good results with Acrobat 
either, although the problems are not as bad as with PDFBox. Acrobat extracts 
nothing for 'ę' and 'ą' but 'na przykład miłe' is extracted correctly.

Calling setSpacingTolerance(0.3) on PDFTextStripper seems to produce better 
results.

> maybe encoding error
> --------------------
>
>                 Key: PDFBOX-2547
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2547
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7
>            Reporter: Michał
>            Priority: Minor
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf 
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe' 
> (page 4, line 6).
> Maybe it is some small problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PDFBOX-2547) maybe encoding error

Reply via email to