[ 
https://issues.apache.org/jira/browse/PDFBOX-4101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354990#comment-16354990
 ] 

Tilman Hausherr commented on PDFBOX-4101:
-----------------------------------------

There is no fixed rule that the sort mode is better than the unsorted mode... 
sometimes, the unsorted mode is better, e.g. if a column PDF was created with 
the text in perfect reading order. (Open your file with Adobe Reader and try to 
mark the three lines of the left column of page 2... you can't. It will mark 
three other segments as well.) The sorted mode is better if you want your text 
at the location of the PDF. However the sort has no proper transitivity rule 
when glyphs have different sizes. (PDFBOX-1512)

> Word ordering / line detection failures in text extraction
> ----------------------------------------------------------
>
>                 Key: PDFBOX-4101
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4101
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.8
>            Reporter: Alexandre
>            Priority: Major
>         Attachments: fails_line_detection-sort.txt, 
> fails_line_detection-unsort.txt, fails_line_detection.pdf, hardtests-11.png
>
>
> Dear Apache contributors,
> I am a (y) user of pdfbox mainly for the purpose of text extraction. The word 
> ordering is not correct for some cases and the line detection may fail too.
> Attachments:
>  * 1st page: the first letter D is not written before "uis sit amet..." but 
> at the end of the page ;
>  * 2nd page: the sentence "scolaire ferry" is just before "réouverture du 
> musée" which is wrong because it's not on the same column ;
> To manage these cases would be more than welcome :D A.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to