[ 
https://issues.apache.org/jira/browse/PDFBOX-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149430#comment-17149430
 ] 

Maruan Sahyoun commented on PDFBOX-4904:
----------------------------------------

sort is not the default option as

a) there is an impact (although small) in performance and

b) more importantly it doesn't work well for multi colum layouts and tables

There is an option in PDF to mark multicolumn text or articles in what's called 
Beads but that's not commonly used. PDFs could also have structure information 
which again is not commonly used. So there is room for improvement when it 
comes to text extraction for complex layouts.

> Bold text leads to wrong order - Text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-4904
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4904
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, PDModel
>    Affects Versions: 2.0.20
>         Environment: JDK 8
>            Reporter: Ronald Bergmann
>            Assignee: Maruan Sahyoun
>            Priority: Minor
>              Labels: Documentation
>             Fix For: 2.0.21, 3.0.0 PDFBox
>
>         Attachments: 152-0130-20-B-Ö-43.pdf, 152-0130-20-B-Ö-43.txt, Two 
> columns 1.pdf
>
>
> When extracting the text from a PDF bold text seems to be out of order under 
> some conditions.
>  
> {code:java}
> try (PDDocument doc = PDDocument.load(new File("152-0130-20-B-Ö-43.pdf"))) {
>     PDFTextStripper stripper = new PDFTextStripper();
>     String contents = stripper.getText(doc);
>     System.out.println(contents);
> }
> {code}
> See section w) - the text should be:
> _*Präqualifizierte Unternehmen* führen den Nachweis der Eignung durch den 
> Eintrag in_
>  _die Liste des Vereins für die Präqualifikation von Bauunternehmen e.V._
>  _(Präqualifikationsverzeichnis). ..._
> But it actually is:
>  _führen den Nachweis der Eignung durch den Eintrag in *Präqualifizierte 
> Unternehmen*_
>  _die Liste des Vereins für die Präqualifikation von Bauunternehmen e.V._
>  _(Präqualifikationsverzeichnis)._
>  
> I attached an example PDF.
>  
> /edit: pdfjs and Acrobat can copy/paste the text in order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to