[ 
https://issues.apache.org/jira/browse/PDFBOX-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Rodríguez Alfayate updated PDFBOX-881:
--------------------------------------------

    Attachment: pdfbox-characterspacing.patch

Patch for the described issue

> Incorrect output when word spacing is achieved by matrix translation
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-881
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-881
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1, 1.4.0
>            Reporter: David Rodríguez Alfayate
>         Attachments: alta_padron.pdf, output_1_2.txt, output_1_3.txt, 
> pdfbox-characterspacing.patch
>
>
> When extracting text in a PDF document in which word spacing is achieved by 
> matrix translation, in versions 1.3.x and 1.4 the different words are being 
> merged.
> This situation doesn't happen in 1.2 branch. After investigating a bit, the 
> error was introduced with a refactoring of the PDFStreamEngine class, and is 
> related to textMatrixEnd computation. In 1.2 branch the characterSpacingWidth 
> was added after computing the textMatrixEnd, but in 1.3 (and 1.4) this 
> characterSpacingWidth is preadded to the textMatrixEnd, so the system is 
> unable to detect a new word.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to