[jira] [Closed] (PDFBOX-1153) Use dictionary lookups to increase text extraction accuracy

John Hewson (JIRA) Tue, 17 Jun 2014 13:20:27 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson closed PDFBOX-1153.
-------------------------------

    Resolution: Won't Fix

This won't work in practice, there are too many non-dictionary words out there.

> Use dictionary lookups to increase text extraction accuracy
> -----------------------------------------------------------
>
>                 Key: PDFBOX-1153
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1153
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>            Reporter: Jukka Zitting
>
> There are still some cases where the text extraction code incorrectly inserts 
> spaces inside words extracted from a PDF document. We could increase 
> extraction accuracy with an optional dictionary lookup mechanism that checks 
> each extracted word or token against a dictionary of common words. If the 
> lookup fails (and the amount of empty space after the token is small), the 
> token is concatenated with the next one. If that concatenated token matches a 
> word in the dictionary, the intervening space can very likely be dropped.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (PDFBOX-1153) Use dictionary lookups to increase text extraction accuracy

Reply via email to