[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

JIRA Sun, 23 Sep 2018 03:17:24 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625055#comment-16625055
 ]


Andreas Lehmkühler commented on PDFBOX-4313:
--------------------------------------------

I've attached the resulting pdf from the given test and both results from text 
extraction (sorted and unsorted) using the 2.0 branch. The unsorted result 
isn't useful as the text is stored unsorted in the pdf. The sorted result 
doesn't show any issues with the number values in the second row. The column 
headers are difficult but the result is as good/bad as expected with one 
exception. There seems to be an issue with a missing line break after "Modul"

> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
>                 Key: PDFBOX-4313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.11
>            Reporter: Emilian Bold
>            Priority: Major
>         Attachments: 1536938716546.pdf, PDFBOX-4313-Test.pdf, 
> PDFBOX-4313-Test_sorted.txt, PDFBOX-4313-Test_unsorted.txt, PDFBOX-4313.pdf, 
> PDFBOX4313Test.java, PDFBOX4313Test.java, crop-fisa-sintetica.png, 
> pdfbox-words.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>                    // test if our TextPosition starts after a new word would 
> be expected to start
>                     if (expectedStartOfNextWordX != 
> EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX &&
>                             // only bother adding a space if the last 
> character was not a space
>                             lastPosition.getTextPosition().getUnicode() != 
> null
>                             && 
> !lastPosition.getTextPosition().getUnicode().endsWith(" "))
>                     {
>                         line.add(LineItem.getWordSeparator());
>                     }
> }}
> which seems to add a word separator only if the next char is "after" the 
> current word. It never expects that the next char might be "before" the 
> current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain 
> PDF, it just seems that Oracle Reports generates these chunks in the reverse 
> order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words

Reply via email to