[ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607993#comment-16607993 ]
Tilman Hausherr commented on PDFBOX-4313: ----------------------------------------- The cropping may only have changed the cropbox rectangle. I can't do any changes without a test PDF. There is more than just what you posted. I need the PDF or a reduced version of it. A reduced version may be possible if you create a decoded file first (command line utilities "WriteDecodedDoc"), and then change the content stream with an editor. Of course for that you'd need to know a bit about the content stream operators etc. Alternatively change the source code in the way you think is needed and then run the build tests. If they pass without errors, or only improvements, please tell what you did and I'll run additional tests with files that are not in the repository due to copyright reasons. > PDFTextStripper groups unrelated chunks into words > -------------------------------------------------- > > Key: PDFBOX-4313 > URL: https://issues.apache.org/jira/browse/PDFBOX-4313 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.11 > Reporter: Emilian Bold > Priority: Major > Attachments: crop-fisa-sintetica.png > > > I have the text "10" and "11" and they get merged into to "1110" word. > Coordinates are: > 1 575.36 x 227.4 w 4.447998 h 5.736 > 1 579.752 x 227.4 w 4.447998 h 5.736 > 1 526.2 x 227.4 w 4.447998 h 5.736 > 0 530.59204 x 227.4 w 4.447998 h 5.736 > The bug is in in this PDFTextStripper chunk: > {{ > // test if our TextPosition starts after a new word would > be expected to start > if (expectedStartOfNextWordX != > EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE > && expectedStartOfNextWordX < positionX && > // only bother adding a space if the last > character was not a space > lastPosition.getTextPosition().getUnicode() != > null > && > !lastPosition.getTextPosition().getUnicode().endsWith(" ")) > { > line.add(LineItem.getWordSeparator()); > } > }} > which seems to add a word separator only if the next char is "after" the > current word. It never expects that the next char might be "before" the > current word. > I guess this could also be framed as a RTL problem, but the PDF is a plain > PDF, it just seems that Oracle Reports generates these chunks in the reverse > order. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org