[
https://issues.apache.org/jira/browse/PDFBOX-4904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149344#comment-17149344
]
Alfred commented on PDFBOX-4904:
--------------------------------
I was under the impression sortByPosition does not work very well and I tried
never to use it.
Mainly because it breaks extraction from multi column PDFs. Here's an example:
[^Two columns 1.pdf]:
I was not aware that there are cases were it works better with the flag than
without it.
I wonder if there's a way to detect when would the flag help or hinder.
Is sortByPosition expected to work in case of multi column?
> Bold text leads to wrong order - Text extraction
> ------------------------------------------------
>
> Key: PDFBOX-4904
> URL: https://issues.apache.org/jira/browse/PDFBOX-4904
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing, PDModel
> Affects Versions: 2.0.20
> Environment: JDK 8
> Reporter: Ronald Bergmann
> Assignee: Maruan Sahyoun
> Priority: Minor
> Labels: Documentation
> Fix For: 2.0.21, 3.0.0 PDFBox
>
> Attachments: 152-0130-20-B-Ö-43.pdf, 152-0130-20-B-Ö-43.txt, Two
> columns 1.pdf
>
>
> When extracting the text from a PDF bold text seems to be out of order under
> some conditions.
>
> {code:java}
> try (PDDocument doc = PDDocument.load(new File("152-0130-20-B-Ö-43.pdf"))) {
> PDFTextStripper stripper = new PDFTextStripper();
> String contents = stripper.getText(doc);
> System.out.println(contents);
> }
> {code}
> See section w) - the text should be:
> _*Präqualifizierte Unternehmen* führen den Nachweis der Eignung durch den
> Eintrag in_
> _die Liste des Vereins für die Präqualifikation von Bauunternehmen e.V._
> _(Präqualifikationsverzeichnis). ..._
> But it actually is:
> _führen den Nachweis der Eignung durch den Eintrag in *Präqualifizierte
> Unternehmen*_
> _die Liste des Vereins für die Präqualifikation von Bauunternehmen e.V._
> _(Präqualifikationsverzeichnis)._
>
> I attached an example PDF.
>
> /edit: pdfjs and Acrobat can copy/paste the text in order.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]