PDFParser mixes the text from separate columns
----------------------------------------------

                 Key: TIKA-611
                 URL: https://issues.apache.org/jira/browse/TIKA-611
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
            Reporter: Julien Nioche
            Assignee: Julien Nioche
             Fix For: 1.0


As reported on the dev list by  Michael Schmitz :

bq. I don't think the current snapshot is parsing articles (pdfs with 
columns/beads) correctly.  The text is not in the write order as it intermixes 
text from different beads.  Try it on an academic paper. 
http://turing.cs.washington.edu/papers/acl08.pdf

This can be fixed by changing the value of setSortByPosition to false, which is 
the default value in PDFTextStripper. This line (PDF2XHTML:82) had been added 
as part of the commit rev 1029510, see 
https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787

Ideally we could specify what value to set for these parameters via the Context 
object, but for the time being wouldn't it make sense to set setSortByPosition 
to the default value of false? I think that this would be the best option for 
most cases where docs have columns.

 




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to