[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Emilian Bold (JIRA) Sat, 08 Sep 2018 02:39:26 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607982#comment-16607982
 ]


Emilian Bold commented on PDFBOX-4313:
--------------------------------------

setSortByPosition(true) makes some errors go away but introduces some new 
others:

Direction switch for `Obligatie de plataDocument`
split 1 > Obligatie de plata 267.88 x 111.600006 Obligatie de plata@ 267.88 x 
111.600006[, `O` @ 267.88 x 111.600006, `b` @ 274.048 x 111.600006, `l` @ 
278.44 x 111.600006, `i` @ 280.224 x 111.600006, `g` @ 282.008 x 111.600006, 
`a` @ 286.4 x 111.600006, `t` @ 290.792 x 111.600006, `i` @ 292.98398 x 
111.600006, `e` @ 294.76797 x 111.600006, ` ` @ 299.15997 x 111.600006, ` ` @ 
301.35196 x 111.600006, `d` @ 303.54395 x 111.600006, `e` @ 307.93594 x 
111.600006, ` ` @ 312.32794 x 111.600006, `p` @ 314.51993 x 111.600006, `l` @ 
318.91193 x 111.600006, `a` @ 320.69592 x 111.600006, `t` @ 325.08792 x 
111.600006, `a` @ 327.2799 x 111.600006]
split 2 > Document 72.84 x 117.79999 Document@ 72.84 x 117.79999[, `D` @ 72.84 
x 117.79999, `o` @ 78.6 x 117.79999, `c` @ 82.992 x 117.79999, `u` @ 86.967995 
x 117.79999, `m` @ 91.35999 x 117.79999, `e` @ 98.079994 x 117.79999, `n` @ 
102.47199 x 117.79999, `t` @ 106.86399 x 117.79999]

These two chunks that get mushed together after setSortByPosition(true) are 
part of this table header: 

!crop-fisa-sintetica.png!

 

To me the bug still seems related to PDFTextStripper (which should order the 
items anyhow if it requires a specific ordering).

I cannot attache the PDF as it contains financial records. Oddly enough even 
cropping the PDF (with macOS Preview) seems to preserve some confidential text 
that's outside the bounds of the crop.

> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
>                 Key: PDFBOX-4313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.11
>            Reporter: Emilian Bold
>            Priority: Major
>         Attachments: crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>                    // test if our TextPosition starts after a new word would 
> be expected to start
>                     if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX &&
>                             // only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getUnicode() != 
> null
>                             && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
>                     {
>                         line.add(LineItem.getWordSeparator());
>                     }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Reply via email to