[ 
https://issues.apache.org/jira/browse/PDFBOX-3680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857812#comment-15857812
 ] 

Dominik Bauer commented on PDFBOX-3680:
---------------------------------------

I use Apache-Tika 1.13 to extract the Text from the pdf. There is a 
configuration Class, which lets me set the sortByPosition flag for pdfbox, but 
the comment on the flag is irritating me.

{code:title=PDFParserConfig.java}
    // True if we should sort text tokens by position
    // (necessary for some PDFs, but messes up other PDFs):
    private boolean sortByPosition = false;
{code}

*Does this option in pdfbox really mess up the texts?*

> Extracted text in wrong order [header, footer, content]
> -------------------------------------------------------
>
>                 Key: PDFBOX-3680
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3680
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.1
>            Reporter: Dominik Bauer
>         Attachments: 1_to_3_Text.txt, DSG 2000, Fassung vom 27.01.2017.pdf
>
>
> When I extract the text from the attached pdf, the text is in the wrong 
> order. 
> Every page has a header, which is "Bundesrecht konsolidiert" some content and 
> a footer, which is "www.ris.bka.gv.at Seite x von y". The content of the 
> footer is a URL and the page number in German language.
> In my eyes the extracted text should have the same order, as we would look at 
> it. The correct order would be header, content, footer. 
> When I open the File in Adobe Reader an copy the text from the page, the text 
> is also in the same order.
> The extracted text is:
> {quote}
>  Bundesrecht konsolidiert 
> www.ris.bka.gv.at Seite 1 von 35 
> Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
> {quote}
> When we look at the page; then the extracted text should be:
> {quote}
>  Bundesrecht konsolidiert 
> Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
> www.ris.bka.gv.at Seite 1 von 35 
> {quote}
> The pdf itself and the extracted text of the first three pages is attached to 
> this Ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to