Andy Phillips created PDFBOX-1805:
-------------------------------------

             Summary: PDFTextStripper, add word segment even if the last word 
is a space
                 Key: PDFBOX-1805
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1805
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.3
            Reporter: Andy Phillips


I found that, in some PDFs, not injecting a WordSpacing in a line that is 
greater than expected for a space in the "line" normalization, causes text 
"fields" that should be separated (as they are not really part of the 
paragraph) to be improperly added to the line of text.  

In the attached pdf, i have found that looking at the first line of the first 
violation of code, that the "Corrected By" date is incorrectly added to the 
same line of Description of Violation.   This is due to the fact that the first 
line of "Description of Violation" ends with a space.   This is due to word 
wrapping of the paragraph when it was generated and i believe that if the next 
letter in the line is greater than an expected space, regardless if the last 
line ends in a space, it should be considered a second segment.






--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to