[ 
https://issues.apache.org/jira/browse/PDFBOX-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-1805:
------------------------------------
    Description: 
I found that, in some PDFs, not injecting a WordSpacing in a line that is 
greater than expected for a space in the "line" normalization, causes text 
"fields" that should be separated (as they are not really part of the 
paragraph) to be improperly added to the line of text.  

In the attached pdf, i have found that looking at the first line of the first 
violation of code, that the "Corrected By" date is incorrectly added to the 
same line of Description of Violation.   This is due to the fact that the first 
line of "Description of Violation" ends with a space.   This is due to word 
wrapping of the paragraph when it was generated and i believe that if the next 
letter in the line is greater than an expected space, regardless if the last 
line ends in a space, it should be considered a second segment.

I suggest removing the following change in PDFTextStripper file (i commented 
out the last two requirements from the if statement):
{code}
                   //Test if our TextPosition starts after a new word would be 
expected to start.
                    if (expectedStartOfNextWordX != 
EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE
                            && expectedStartOfNextWordX < positionX) /* &&
                            //only bother adding a space if the last character 
was not a space
                            lastPosition.getTextPosition().getCharacter() != 
null &&
                            
!lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */
                    {
                        line.add(WordSeparator.getSeparator());
                    }
{code}

  was:
I found that, in some PDFs, not injecting a WordSpacing in a line that is 
greater than expected for a space in the "line" normalization, causes text 
"fields" that should be separated (as they are not really part of the 
paragraph) to be improperly added to the line of text.  

In the attached pdf, i have found that looking at the first line of the first 
violation of code, that the "Corrected By" date is incorrectly added to the 
same line of Description of Violation.   This is due to the fact that the first 
line of "Description of Violation" ends with a space.   This is due to word 
wrapping of the paragraph when it was generated and i believe that if the next 
letter in the line is greater than an expected space, regardless if the last 
line ends in a space, it should be considered a second segment.

I suggest removing the following change in PDFTextStripper file (i commented 
out the last two requirements from the if statement):

                   //Test if our TextPosition starts after a new word would be 
expected to start.
                    if (expectedStartOfNextWordX != 
EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE
                            && expectedStartOfNextWordX < positionX) /* &&
                            //only bother adding a space if the last character 
was not a space
                            lastPosition.getTextPosition().getCharacter() != 
null &&
                            
!lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */
                    {
                        line.add(WordSeparator.getSeparator());
                    }




> PDFTextStripper, add word segment even if the last word is a space
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-1805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>            Reporter: Andy Phillips
>         Attachments: 36048C3D-54B9-4862-91AA-C94B33C11027.pdf
>
>
> I found that, in some PDFs, not injecting a WordSpacing in a line that is 
> greater than expected for a space in the "line" normalization, causes text 
> "fields" that should be separated (as they are not really part of the 
> paragraph) to be improperly added to the line of text.  
> In the attached pdf, i have found that looking at the first line of the first 
> violation of code, that the "Corrected By" date is incorrectly added to the 
> same line of Description of Violation.   This is due to the fact that the 
> first line of "Description of Violation" ends with a space.   This is due to 
> word wrapping of the paragraph when it was generated and i believe that if 
> the next letter in the line is greater than an expected space, regardless if 
> the last line ends in a space, it should be considered a second segment.
> I suggest removing the following change in PDFTextStripper file (i commented 
> out the last two requirements from the if statement):
> {code}
>                    //Test if our TextPosition starts after a new word would 
> be expected to start.
>                     if (expectedStartOfNextWordX != 
> EXPECTEDSTARTOFNEXTWORDX_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX) /* &&
>                             //only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getCharacter() != 
> null &&
>                             
> !lastPosition.getTextPosition().getCharacter().endsWith( " " ) ) */
>                     {
>                         line.add(WordSeparator.getSeparator());
>                     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to