[
https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Phillips updated PDFBOX-1805:
----------------------------------
Attachment: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf
> PDFTextStripper, add word segment even if the last word is a space
> ------------------------------------------------------------------
>
> Key: PDFBOX-1805
> URL: https://issues.apache.org/jira/browse/PDFBOX-1805
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.3
> Reporter: Andy Phillips
> Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf
>
>
> I found that, in some PDFs, not injecting a WordSpacing in a line that is
> greater than expected for a space in the "line" normalization, causes text
> "fields" that should be separated (as they are not really part of the
> paragraph) to be improperly added to the line of text.
> In the attached pdf, i have found that looking at the first line of the first
> violation of code, that the "Corrected By" date is incorrectly added to the
> same line of Description of Violation. This is due to the fact that the
> first line of "Description of Violation" ends with a space. This is due to
> word wrapping of the paragraph when it was generated and i believe that if
> the next letter in the line is greater than an expected space, regardless if
> the last line ends in a space, it should be considered a second segment.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)