[ https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-1805: ------------------------------------ Attachment: PDFBOX-1805.txt > PDFTextStripper, add word segment even if the last word is a space > ------------------------------------------------------------------ > > Key: PDFBOX-1805 > URL: https://issues.apache.org/jira/browse/PDFBOX-1805 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.3 > Reporter: Andy Phillips > Priority: Major > Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf, PDFBOX-1805.txt > > > I found that, in some PDFs, not injecting a WordSpacing in a line that is > greater than expected for a space in the "line" normalization, causes text > "fields" that should be separated (as they are not really part of the > paragraph) to be improperly added to the line of text. > In the attached pdf, i have found that looking at the first line of the first > violation of code, that the "Corrected By" date is incorrectly added to the > same line of Description of Violation. This is due to the fact that the > first line of "Description of Violation" ends with a space. This is due to > word wrapping of the paragraph when it was generated and i believe that if > the next letter in the line is greater than an expected space, regardless if > the last line ends in a space, it should be considered a second segment. > I suggest removing the following change in PDFTextStripper file (i commented > out the last two requirements from the if statement): > {code} > //Test if our TextPosition starts after a new word would > be expected to start. > if (expectedStartOfNextWordX != > EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE > && expectedStartOfNextWordX < positionX) /* && > //only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getCharacter() != > null && > > !lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */ > { > line.add(WordSeparator.getSeparator()); > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org