[jira] [Issue Comment Deleted] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Paul Slootweg (Jira) Thu, 22 Aug 2019 07:24:32 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Slootweg updated PDFBOX-4313:
----------------------------------
    Comment: was deleted

(was: I am currently seeing a similar problem - in this case a line of bold 
text has a line of standard text below it and places the second line as part of 
the first.

This looks to be because it is using the bold font height to compare the 
overlap for the standard line.

Unfortunately at this point I can't provide a failing PDF, but I will try to 
see if I can.)

> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
>                 Key: PDFBOX-4313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.11
>            Reporter: Emilian Bold
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>         Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>                    // test if our TextPosition starts after a new word would 
> be expected to start
>                     if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX &&
>                             // only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getUnicode() != 
> null
>                             && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
>                     {
>                         line.add(LineItem.getWordSeparator());
>                     }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Issue Comment Deleted] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Reply via email to