[ https://issues.apache.org/jira/browse/PDFBOX-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906122#comment-17906122 ]
ASF subversion and git services commented on PDFBOX-3774: --------------------------------------------------------- Commit 1922536 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1922536 ] PDFBOX-3774: conditionally ignore spaces from the content stream; add setting + getter/setter + test + code simplification by Kevin Day > Incorrectly extracted text (broken words) > ----------------------------------------- > > Key: PDFBOX-3774 > URL: https://issues.apache.org/jira/browse/PDFBOX-3774 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.5 > Environment: Darwin Ninos-MacBook-Pro.local 16.5.0 Darwin Kernel > Version 16.5.0: Fri Mar 3 16:52:33 PST 2017; > root:xnu-3789.51.2~3/RELEASE_X86_64 x86_64 > Reporter: Nino Skopac > Priority: Major > Attachments: PDFBOX-3774_reduced.pdf > > > First reported on Tika JIRA > (https://issues.apache.org/jira/browse/TIKA-2342), but tracked down to PDFBox: > ~ Usage > java -jar pdfbox-app-2.0.5.jar ExtractText Huge_book.pdf Huge-book-pdfbox.txt > ~ Sample > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" > Thank you, > Nino -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org