[ 
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004035#comment-13004035
 ] 

Julien Nioche commented on TIKA-611:
------------------------------------

The current behaviour is incorrect not only for academic research papers but 
for any document using columns (e.g. contracts, etc...) and was how things were 
done prior to the modif in Tika-446
Can we fix the boolean value in this issue then open a new issue and implement 
the mechanism with ParseContext for this and the other params?

> PDFParser mixes the text from separate columns
> ----------------------------------------------
>
>                 Key: TIKA-611
>                 URL: https://issues.apache.org/jira/browse/TIKA-611
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.0
>
>
> As reported on the dev list by  Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with 
> columns/beads) correctly.  The text is not in the write order as it 
> intermixes text from different beads.  Try it on an academic paper. 
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which 
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been 
> added as part of the commit rev 1029510, see 
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the 
> Context object, but for the time being wouldn't it make sense to set 
> setSortByPosition to the default value of false? I think that this would be 
> the best option for most cases where docs have columns.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to