[
https://issues.apache.org/jira/browse/PDFBOX-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson closed PDFBOX-1153.
-------------------------------
Resolution: Won't Fix
This won't work in practice, there are too many non-dictionary words out there.
> Use dictionary lookups to increase text extraction accuracy
> -----------------------------------------------------------
>
> Key: PDFBOX-1153
> URL: https://issues.apache.org/jira/browse/PDFBOX-1153
> Project: PDFBox
> Issue Type: New Feature
> Components: Text extraction
> Reporter: Jukka Zitting
>
> There are still some cases where the text extraction code incorrectly inserts
> spaces inside words extracted from a PDF document. We could increase
> extraction accuracy with an optional dictionary lookup mechanism that checks
> each extracted word or token against a dictionary of common words. If the
> lookup fails (and the amount of empty space after the token is small), the
> token is concatenated with the next one. If that concatenated token matches a
> word in the dictionary, the intervening space can very likely be dropped.
--
This message was sent by Atlassian JIRA
(v6.2#6252)