[ 
https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1805:
------------------------------------
    Attachment: PDFBOX-1805.txt

> PDFTextStripper, add word segment even if the last word is a space
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-1805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>            Reporter: Andy Phillips
>            Priority: Major
>         Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf, PDFBOX-1805.txt
>
>
> I found that, in some PDFs, not injecting a WordSpacing in a line that is 
> greater than expected for a space in the "line" normalization, causes text 
> "fields" that should be separated (as they are not really part of the 
> paragraph) to be improperly added to the line of text.  
> In the attached pdf, i have found that looking at the first line of the first 
> violation of code, that the "Corrected By" date is incorrectly added to the 
> same line of Description of Violation.   This is due to the fact that the 
> first line of "Description of Violation" ends with a space.   This is due to 
> word wrapping of the paragraph when it was generated and i believe that if 
> the next letter in the line is greater than an expected space, regardless if 
> the last line ends in a space, it should be considered a second segment.
> I suggest removing the following change in PDFTextStripper file (i commented 
> out the last two requirements from the if statement):
> {code}
>                    //Test if our TextPosition starts after a new word would 
> be expected to start.
>                     if (expectedStartOfNextWordX != 
> EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX) /* &&
>                             //only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getCharacter() != 
> null &&
>                             
> !lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */
>                     {
>                         line.add(WordSeparator.getSeparator());
>                     }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to