Text extract fails on some PDF files but not others...
------------------------------------------------------
Key: PDFBOX-620
URL: https://issues.apache.org/jira/browse/PDFBOX-620
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.8.0-incubator, 0.7.3
Environment: Tried in Java 5 and 6
Reporter: Nicholas Cottrell
Having the same problem with 0.7.3, 0.7.4-dev and 0.8.0 - in 0.7.3 I get text
with nulls, e.g. "Dermoapo made 'interactive updates' a key part onullits
stratenull nullr launnull chinnulla new skincare rannull in a competitive
market. nulle resultnullIncreased sales nullr pharmacies that used the
updates." while in 0.8.0 it appears as "Dermoapo made 'interactive updates' a
key part o?its strate? ?r laun?
chin?a new skincare ran? in a competitive market. ?e result?Increased
sales ?r pharmacies that used the updates."
Maybe this is a font problem? Or encoding? I debugged the code in
PDFTextStripper and and these appear in the charactersByArticle field even
before normalization.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.