[
https://issues.apache.org/jira/browse/PDFBOX-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joel Hirsh updated PDFBOX-2463:
-------------------------------
Attachment: mangled_text .pdf
Snippet that shows problem
> ExtractTextByArea mangling second half of this string - transposed, skipped,
> etc
> --------------------------------------------------------------------------------
>
> Key: PDFBOX-2463
> URL: https://issues.apache.org/jira/browse/PDFBOX-2463
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.7
> Reporter: Joel Hirsh
> Attachments: mangled_text .pdf
>
>
> PDF snippet is being completely mangled by ExtractTextByArea. Have a large
> PDF file where this is happening on every line.
> Visually (and Acrobat) show the text:
> 12 Jun EP COPY WORKS LIMITED 503646200256 5637 3.70 11,252.49 OD
> However ExtractTextByArea comes up with:
> 12 Jun EP COPY WORKS LIMITED 503646200256 35 .6 70
> 11,
> 3 257 2.49
> OD
> So the first half of the string is ok, but starting at '5637' characters are
> skipped, other characters are inserted, completely mangled.
> FWIW I did dump the COSString's in PDFStreamEngine and the strings all show
> correctly, nothing unusual.
> Test file to be attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)