[ https://issues.apache.org/jira/browse/PDFBOX-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076401#comment-14076401 ]
Tilman Hausherr edited comment on PDFBOX-2247 at 7/28/14 6:42 PM: ------------------------------------------------------------------ The change happened between 1592410 and 1592629 in PDFBOX-2058. was (Author: tilman): The change happened between 1592410 and 1592629. > Regression in text extraction between 1.8.5 and 1.8.6 > ----------------------------------------------------- > > Key: PDFBOX-2247 > URL: https://issues.apache.org/jira/browse/PDFBOX-2247 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.6 > Reporter: Tim Allison > Priority: Minor > > Looks like a character mapping issue crept in some time between 1.8.5 and > 1.8.6 on this > [file|http://digitalcorpora.org/corp/nps/files/govdocs1/701/701542.pdf]? > With both seq and NonSeq parsers, the correct text was extracted via > ExtractText in 1.8.5. In 1.8.6, java -jar pdfbox-app-1.8.6.jar ExtractText > yields text starting with: {noformat}7>PFLK>I 9>NH ;BNRF@B > =%;% .BM>NPJBKP LC PEB 3KPBNFLN > 9>@FCF@ -L>OP ;@FBK@B >KA 5B>NKFKD -BKPBN > :BOB>N@E 9NLGB@P ;QJJ>NT .B@BJ?BN (&&* > "&++&,-+Æ$( #&+-&%+$-& !).&)-*+Æ&,{noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)