Tim Allison created PDFBOX-2247:
-----------------------------------
Summary: Regression in text extraction between 1.8.5 and 1.8.6
Key: PDFBOX-2247
URL: https://issues.apache.org/jira/browse/PDFBOX-2247
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.6
Reporter: Tim Allison
Priority: Minor
Looks like a character mapping issue crept in some time between 1.8.5 and 1.8.6
on this
[file|http://digitalcorpora.org/corp/nps/files/govdocs1/701/701542.pdf]?
With both seq and NonSeq parsers, the correct text was extracted via
ExtractText in 1.8.5. In 1.8.6, java -jar pdfbox-app-1.8.6.jar ExtractText
yields text starting with: {noformat}7>PFLK>I 9>NH ;BNRF@B
=%;% .BM>NPJBKP LC PEB 3KPBNFLN
9>@FCF@ -L>OP ;@FBK@B >KA 5B>NKFKD -BKPBN
:BOB>N@E 9NLGB@P ;QJJ>NT .B@BJ?BN (&&*
"&++&,-+Æ$( #&+-&%+$-& !).&)-*+Æ&,{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)