Andy Phillips created PDFBOX-1805:
-------------------------------------
Summary: PDFTextStripper, add word segment even if the last word
is a space
Key: PDFBOX-1805
URL: https://issues.apache.org/jira/browse/PDFBOX-1805
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.3
Reporter: Andy Phillips
I found that, in some PDFs, not injecting a WordSpacing in a line that is
greater than expected for a space in the "line" normalization, causes text
"fields" that should be separated (as they are not really part of the
paragraph) to be improperly added to the line of text.
In the attached pdf, i have found that looking at the first line of the first
violation of code, that the "Corrected By" date is incorrectly added to the
same line of Description of Violation. This is due to the fact that the first
line of "Description of Violation" ends with a space. This is due to word
wrapping of the paragraph when it was generated and i believe that if the next
letter in the line is greater than an expected space, regardless if the last
line ends in a space, it should be considered a second segment.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)