[
https://issues.apache.org/jira/browse/PDFBOX-3774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992827#comment-15992827
]
Tim Allison commented on PDFBOX-3774:
-------------------------------------
And my favorite test, google "noncertifi ed member", and you'll find this book.
If you search google books for "Each certified and noncertified member,"
you'll find an earlier edition. If you search google generally for that
phrase, it looks like one aggregator was able to extract it correctly.
> Incorrectly extracted text (broken words)
> -----------------------------------------
>
> Key: PDFBOX-3774
> URL: https://issues.apache.org/jira/browse/PDFBOX-3774
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.5
> Environment: Darwin Ninos-MacBook-Pro.local 16.5.0 Darwin Kernel
> Version 16.5.0: Fri Mar 3 16:52:33 PST 2017;
> root:xnu-3789.51.2~3/RELEASE_X86_64 x86_64
> Reporter: Nino Skopac
> Attachments: Huge_book.pdf
>
>
> First reported on Tika JIRA
> (https://issues.apache.org/jira/browse/TIKA-2342), but tracked down to PDFBox:
> ~ Usage
> java -jar pdfbox-app-2.0.5.jar ExtractText Huge_book.pdf Huge-book-pdfbox.txt
> ~ Sample
> Original PDF text: "Each certified or noncertified member"
> Tika extracted text: "Each certifi ed or noncertifi ed member"
> Thank you,
> Nino
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]