[
https://issues.apache.org/jira/browse/PDFBOX-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David RodrÃguez Alfayate updated PDFBOX-881:
--------------------------------------------
Attachment: output_1_3.txt
output_1_2.txt
alta_padron.pdf
Sample PDF document, and output from 1.2 and 1.3 versions
> Incorrect output when word spacing is achieved by matrix translation
> --------------------------------------------------------------------
>
> Key: PDFBOX-881
> URL: https://issues.apache.org/jira/browse/PDFBOX-881
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.3.1, 1.4.0
> Reporter: David RodrÃguez Alfayate
> Attachments: alta_padron.pdf, output_1_2.txt, output_1_3.txt,
> pdfbox-characterspacing.patch
>
>
> When extracting text in a PDF document in which word spacing is achieved by
> matrix translation, in versions 1.3.x and 1.4 the different words are being
> merged.
> This situation doesn't happen in 1.2 branch. After investigating a bit, the
> error was introduced with a refactoring of the PDFStreamEngine class, and is
> related to textMatrixEnd computation. In 1.2 branch the characterSpacingWidth
> was added after computing the textMatrixEnd, but in 1.3 (and 1.4) this
> characterSpacingWidth is preadded to the textMatrixEnd, so the system is
> unable to detect a new word.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.