[ 
https://issues.apache.org/jira/browse/TIKA-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295159#comment-14295159
 ] 

Tim Allison commented on TIKA-1533:
-----------------------------------

In the first document, printed page 303/pdf page 152 contains Tabell 5.7 - 
Tabell 5.9?  I only see "362" on printed page 362 and in "sammanlagt 362 
frågor" on printed page 88, pdf page 45.

Have you run straight PDFBox's app with ExtractText to see if that is having 
the same issue as Tika?

> PDF parse failing to capture right order of text (2 columns)
> ------------------------------------------------------------
>
>                 Key: TIKA-1533
>                 URL: https://issues.apache.org/jira/browse/TIKA-1533
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6, 1.7
>         Environment: Java 8, Mac OS X
>            Reporter: Tamara
>
> When I am converting a document with two columns the order of the columns are 
> inverted in the text file. I only could notice because it is an index list. 
> The page I start to see the problem is the page 303, to look in the converted 
> text look for 362. In the second file I have the same problem the page is 341.
> I have tried: setSortByPosition(true) and the columns got scrambled.
> I have tried to copy and paste from the pdf preview and the copy is as it 
> should.
> And I have tried to use PDFXStream and it parses in the right way.
> Here are the files I have seen the issue:
> http://www.sbu.se/upload/Publikationer/Content0/1/Autismspektrumtillst%C3%A5nd_fulltext.pdf
> http://www.sbu.se/upload/publikationer/content0/1/forstamningssyndrom_fulltext.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to