[ 
https://issues.apache.org/jira/browse/PDFBOX-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-881.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.4.0
         Assignee: Andreas Lehmkühler

According to the pdf specs and from the rendering point of view the 
calculations are quite perfect, but David is correct with his concerns about 
text extraction. The end of each text chunk should be calculated without 
including possible spacing values, so that the text extraction algorithm is 
able to detect new words.

Thanks for the contribution. I added the proposed patch in revision 1031580 
inlcuding some minor tweaks and some optimizations, e.g. I removed all the 
matrix copies which aren't really needed.


> Incorrect output when word spacing is achieved by matrix translation
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-881
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-881
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1, 1.4.0
>            Reporter: David Rodríguez Alfayate
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.4.0
>
>         Attachments: alta_padron.pdf, output_1_2.txt, output_1_3.txt, 
> pdfbox-characterspacing.patch
>
>
> When extracting text in a PDF document in which word spacing is achieved by 
> matrix translation, in versions 1.3.x and 1.4 the different words are being 
> merged.
> This situation doesn't happen in 1.2 branch. After investigating a bit, the 
> error was introduced with a refactoring of the PDFStreamEngine class, and is 
> related to textMatrixEnd computation. In 1.2 branch the characterSpacingWidth 
> was added after computing the textMatrixEnd, but in 1.3 (and 1.4) this 
> characterSpacingWidth is preadded to the textMatrixEnd, so the system is 
> unable to detect a new word.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to