Use dictionary lookups to increase text extraction accuracy
-----------------------------------------------------------

                 Key: PDFBOX-1153
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1153
             Project: PDFBox
          Issue Type: New Feature
          Components: Text extraction
            Reporter: Jukka Zitting


There are still some cases where the text extraction code incorrectly inserts 
spaces inside words extracted from a PDF document. We could increase extraction 
accuracy with an optional dictionary lookup mechanism that checks each 
extracted word or token against a dictionary of common words. If the lookup 
fails (and the amount of empty space after the token is small), the token is 
concatenated with the next one. If that concatenated token matches a word in 
the dictionary, the intervening space can very likely be dropped.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to