[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

JIRA Sun, 23 Sep 2018 04:52:10 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625081#comment-16625081
 ]


Andreas Lehmkühler commented on PDFBOX-4313:
--------------------------------------------

Linebreaks are triggered only if the last and the current textposition don't 
overlap at all. The given case is a corner case.

This is the relevant code from PDFTextStripper
{code}
private boolean overlap(float y1, float height1, float y2, float height2)
{
    return within(y1, y2, .1f) || y2 <= y1 && y2 >= y1 - height1
            || y1 <= y2 && y1 >= y2 - height2;
}
{code}
These are the relevant testpositions from DrawPrintTextLocations
{code}
String[714.886,293.3178 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 
width=1.3319702]l
String[20.0,297.63782 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 
width=4.3320007]D

293.3178 <= 297.63782 && 293.3178 >= 297.63782 - 3.468 = 293.16982 -> leads to 
"true" and doesn't detect the line break
{code}

I've experimented with some threshold values to make the overlap detection a 
little bit more lenient. I've used 90% of the given height values.
{code}
private boolean overlap(float y1, float height1, float y2, float height2)
{
    return within(y1, y2, .1f) || (y2 <= y1 && y1 - height1 - y2 < - (height1 * 
0.1f))
            || (y1 <= y2 && y2 - height2 - y1 < - (height2 * 0.1f));
}
{code}
Could this be a reasonable solution? Instead of using a fixed threshold we 
could introduce another parameter to change that value from the outside.



> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
>                 Key: PDFBOX-4313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.11
>            Reporter: Emilian Bold
>            Priority: Major
>         Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>                    // test if our TextPosition starts after a new word would 
> be expected to start
>                     if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX &&
>                             // only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getUnicode() != 
> null
>                             && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
>                     {
>                         line.add(LineItem.getWordSeparator());
>                     }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Reply via email to