[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Emilian Bold (JIRA) Tue, 11 Sep 2018 00:59:09 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610232#comment-16610232
 ]


Emilian Bold commented on PDFBOX-4313:
--------------------------------------

>  I can't do any changes without a test PDF. There is more than just what you 
>posted.

You shouldn't need a PDF to make a unit test that fails for PDFTextStripper.

The bug is obvious, you do ' && expectedStartOfNextWordX < positionX' so you 
assume the next character will have an increasing X coord. If you get a 
character with X before the start of the current word, you will still append 
it. So not only there's a bug with regard to detecting there are separate 
words, by appending text that's before the start of the current word, the 
ordering is wrong too!

setSortByPosition(true) does not fix this as shown above.

> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
>                 Key: PDFBOX-4313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.11
>            Reporter: Emilian Bold
>            Priority: Major
>         Attachments: crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>                    // test if our TextPosition starts after a new word would 
> be expected to start
>                     if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX &&
>                             // only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getUnicode() != 
> null
>                             && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
>                     {
>                         line.add(LineItem.getWordSeparator());
>                     }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Reply via email to