Incorrect output when word spacing is achieved by matrix translation
--------------------------------------------------------------------
Key: PDFBOX-881
URL: https://issues.apache.org/jira/browse/PDFBOX-881
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.3.1, 1.4.0
Reporter: David RodrÃguez Alfayate
When extracting text in a PDF document in which word spacing is achieved by
matrix translation, in versions 1.3.x and 1.4 the different words are being
merged.
This situation doesn't happen in 1.2 branch. After investigating a bit, the
error was introduced with a refactoring of the PDFStreamEngine class, and is
related to textMatrixEnd computation. In 1.2 branch the characterSpacingWidth
was added after computing the textMatrixEnd, but in 1.3 (and 1.4) this
characterSpacingWidth is preadded to the textMatrixEnd, so the system is unable
to detect a new word.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.