[ https://issues.apache.org/jira/browse/PDFBOX-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857812#comment-15857812 ]
Dominik Bauer commented on PDFBOX-3680: --------------------------------------- I use Apache-Tika 1.13 to extract the Text from the pdf. There is a configuration Class, which lets me set the sortByPosition flag for pdfbox, but the comment on the flag is irritating me. {code:title=PDFParserConfig.java} // True if we should sort text tokens by position // (necessary for some PDFs, but messes up other PDFs): private boolean sortByPosition = false; {code} *Does this option in pdfbox really mess up the texts?* > Extracted text in wrong order [header, footer, content] > ------------------------------------------------------- > > Key: PDFBOX-3680 > URL: https://issues.apache.org/jira/browse/PDFBOX-3680 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.1 > Reporter: Dominik Bauer > Attachments: 1_to_3_Text.txt, DSG 2000, Fassung vom 27.01.2017.pdf > > > When I extract the text from the attached pdf, the text is in the wrong > order. > Every page has a header, which is "Bundesrecht konsolidiert" some content and > a footer, which is "www.ris.bka.gv.at Seite x von y". The content of the > footer is a URL and the page number in German language. > In my eyes the extracted text should have the same order, as we would look at > it. The correct order would be header, content, footer. > When I open the File in Adobe Reader an copy the text from the page, the text > is also in the same order. > The extracted text is: > {quote} > Bundesrecht konsolidiert > www.ris.bka.gv.at Seite 1 von 35 > Gesamte Rechtsvorschrift [...] und Rechtsnachfolge > {quote} > When we look at the page; then the extracted text should be: > {quote} > Bundesrecht konsolidiert > Gesamte Rechtsvorschrift [...] und Rechtsnachfolge > www.ris.bka.gv.at Seite 1 von 35 > {quote} > The pdf itself and the extracted text of the first three pages is attached to > this Ticket. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org