[ https://issues.apache.org/jira/browse/PDFBOX-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr resolved PDFBOX-3774. ------------------------------------- Resolution: Fixed Tika related improvements will be done in TIKA-2342 > Incorrectly extracted text (broken words) > ----------------------------------------- > > Key: PDFBOX-3774 > URL: https://issues.apache.org/jira/browse/PDFBOX-3774 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.5, 2.0.32, 3.0.3 PDFBox > Environment: Darwin Ninos-MacBook-Pro.local 16.5.0 Darwin Kernel > Version 16.5.0: Fri Mar 3 16:52:33 PST 2017; > root:xnu-3789.51.2~3/RELEASE_X86_64 x86_64 > Reporter: Nino Skopac > Assignee: Tilman Hausherr > Priority: Major > Fix For: 2.0.33, 3.0.4 PDFBox, 4.0.0 > > Attachments: PDFBOX-3774_reduced.pdf > > > First reported on Tika JIRA > (https://issues.apache.org/jira/browse/TIKA-2342), but tracked down to PDFBox: > ~ Usage > java -jar pdfbox-app-2.0.5.jar ExtractText Huge_book.pdf Huge-book-pdfbox.txt > ~ Sample > Original PDF text: "Each certified or noncertified member" > Tika extracted text: "Each certifi ed or noncertifi ed member" > Thank you, > Nino -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org