[ 
https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Phillips updated PDFBOX-1805:
----------------------------------

    Attachment: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf

> PDFTextStripper, add word segment even if the last word is a space
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-1805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>            Reporter: Andy Phillips
>         Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf
>
>
> I found that, in some PDFs, not injecting a WordSpacing in a line that is 
> greater than expected for a space in the "line" normalization, causes text 
> "fields" that should be separated (as they are not really part of the 
> paragraph) to be improperly added to the line of text.  
> In the attached pdf, i have found that looking at the first line of the first 
> violation of code, that the "Corrected By" date is incorrectly added to the 
> same line of Description of Violation.   This is due to the fact that the 
> first line of "Description of Violation" ends with a space.   This is due to 
> word wrapping of the paragraph when it was generated and i believe that if 
> the next letter in the line is greater than an expected space, regardless if 
> the last line ends in a space, it should be considered a second segment.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to