Emilian Bold created PDFBOX-4313:
------------------------------------

             Summary: PDFTextStripper groups unrelated chunks into words
                 Key: PDFBOX-4313
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.11
            Reporter: Emilian Bold


I have the text "10" and "11" and they get merged into to "1110" word.

Coordinates are:

1 575.36 x 227.4 w 4.447998 h 5.736
1 579.752 x 227.4 w 4.447998 h 5.736
1 526.2 x 227.4 w 4.447998 h 5.736
0 530.59204 x 227.4 w 4.447998 h 5.736

The bug is in in this chunk:

{{{
                   // test if our TextPosition starts after a new word would be 
expected to start
                    if (expectedStartOfNextWordX != 
EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
                            && expectedStartOfNextWordX < positionX &&
                            // only bother adding a space if the last character 
was not a space
                            lastPosition.getTextPosition().getUnicode() != null
                            && 
!lastPosition.getTextPosition().getUnicode().endsWith(" "))
                    {
                        line.add(LineItem.getWordSeparator());
                    }
}}}

which seems to add a word separator only if the next char is "after" the 
current word. It never expects that the next char might be "before" the current 
word.

I guess this could also be framed as a RTL problem, but the PDF is a plain PDF, 
it just seems that Oracle Reports generates these chunks in the reverse order.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to