[
https://issues.apache.org/jira/browse/PDFBOX-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238407#comment-14238407
]
John Hewson commented on PDFBOX-2547:
-------------------------------------
Text extraction does of this PDF does not produce good results with Acrobat
either, although the problems are not as bad as with PDFBox. Acrobat extracts
nothing for 'ę' and 'ą' but 'na przykład miłe' is extracted correctly.
Calling setSpacingTolerance(0.3) on PDFTextStripper seems to produce better
results.
> maybe encoding error
> --------------------
>
> Key: PDFBOX-2547
> URL: https://issues.apache.org/jira/browse/PDFBOX-2547
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7
> Reporter: Michał
> Priority: Minor
>
> Hi,
> I just download a pdf form page:
> http://download.jw.org/files/media_books/32/es15_P.pdf
> and wants extract text from this document.
> I use command:
> java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 es15_P.pdf
> resultFile-UTF-8.txt
> But I see some problems for exmaple:
> 1. I see in text file 'STX' and 'ETX' instead of 'ę' and 'ą'.
> 2. extractor return a text 'naprzykładmiłe' instead of 'na przykład miłe'
> (page 4, line 6).
> Maybe it is some small problems.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)