[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625081#comment-16625081 ]
Andreas Lehmkühler commented on PDFBOX-4313: -------------------------------------------- Linebreaks are triggered only if the last and the current textposition don't overlap at all. The given case is a corner case. This is the relevant code from PDFTextStripper {code} private boolean overlap(float y1, float height1, float y2, float height2) { return within(y1, y2, .1f) || y2 <= y1 && y2 >= y1 - height1 || y1 <= y2 && y1 >= y2 - height2; } {code} These are the relevant testpositions from DrawPrintTextLocations {code} String[714.886,293.3178 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 width=1.3319702]l String[20.0,297.63782 fs=6.0 xscale=6.0 height=3.468 space=1.6680002 width=4.3320007]D 293.3178 <= 297.63782 && 293.3178 >= 297.63782 - 3.468 = 293.16982 -> leads to "true" and doesn't detect the line break {code} I've experimented with some threshold values to make the overlap detection a little bit more lenient. I've used 90% of the given height values. {code} private boolean overlap(float y1, float height1, float y2, float height2) { return within(y1, y2, .1f) || (y2 <= y1 && y1 - height1 - y2 < - (height1 * 0.1f)) || (y1 <= y2 && y2 - height2 - y1 < - (height2 * 0.1f)); } {code} Could this be a reasonable solution? Instead of using a fixed threshold we could introduce another parameter to change that value from the outside. > PDFTextStripper groups unrelated chunks into words > -------------------------------------------------- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.11 > Reporter: Emilian Bold > Priority: Major > Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, > PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, > PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, > pdfbox-words.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ > // test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org