Text extract fails on some PDF files but not others...
------------------------------------------------------

                 Key: PDFBOX-620
                 URL: https://issues.apache.org/jira/browse/PDFBOX-620
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator, 0.7.3
         Environment: Tried in Java 5 and 6
            Reporter: Nicholas Cottrell


Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text 
with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits 
stratenull nullr launnull chinnulla new skincare rannull in a competitive 
market. nulle resultnullIncreased sales nullr pharmacies that used the 
updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a 
key part o?its strate? ?r laun?
chin?a new skincare ran? in a competitive market. ?e result?Increased 
sales ?r pharmacies that used the updates." 

Maybe this is a font problem? Or encoding? I debugged the code in 
PDFTextStripper and and these appear in the charactersByArticle field even 
before normalization. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to