[ 
https://issues.apache.org/jira/browse/PDFBOX-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149421#comment-17149421
 ] 

Ronald Bergmann commented on PDFBOX-4904:
-----------------------------------------

That's a good point. Thankfully tables and sorts are not an issue for this 
particular project of mine but for others.

How come it's not the default option, btw? Don't people expect to get the text 
in the order they see it (in the reader, not in the sources)?

> Bold text leads to wrong order - Text extraction
> ------------------------------------------------
>
>                 Key: PDFBOX-4904
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4904
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, PDModel
>    Affects Versions: 2.0.20
>         Environment: JDK 8
>            Reporter: Ronald Bergmann
>            Assignee: Maruan Sahyoun
>            Priority: Minor
>              Labels: Documentation
>             Fix For: 2.0.21, 3.0.0 PDFBox
>
>         Attachments: 152-0130-20-B-Ö-43.pdf, 152-0130-20-B-Ö-43.txt, Two 
> columns 1.pdf
>
>
> When extracting the text from a PDF bold text seems to be out of order under 
> some conditions.
>  
> {code:java}
> try (PDDocument doc = PDDocument.load(new File("152-0130-20-B-Ö-43.pdf"))) {
>     PDFTextStripper stripper = new PDFTextStripper();
>     String contents = stripper.getText(doc);
>     System.out.println(contents);
> }
> {code}
> See section w) - the text should be:
> _*Präqualifizierte Unternehmen* führen den Nachweis der Eignung durch den 
> Eintrag in_
>  _die Liste des Vereins für die Präqualifikation von Bauunternehmen e.V._
>  _(Präqualifikationsverzeichnis). ..._
> But it actually is:
>  _führen den Nachweis der Eignung durch den Eintrag in *Präqualifizierte 
> Unternehmen*_
>  _die Liste des Vereins für die Präqualifikation von Bauunternehmen e.V._
>  _(Präqualifikationsverzeichnis)._
>  
> I attached an example PDF.
>  
> /edit: pdfjs and Acrobat can copy/paste the text in order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to