[
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004025#comment-13004025
]
Chris A. Mattmann commented on TIKA-611:
----------------------------------------
Well, I'm not sure it's the best thing in all cases: not all PDFs are academic
research papers.
I'd be +1 for making this a ParseContext param, and allowing it to override the
default value.
> PDFParser mixes the text from separate columns
> ----------------------------------------------
>
> Key: TIKA-611
> URL: https://issues.apache.org/jira/browse/TIKA-611
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.0
>
>
> As reported on the dev list by Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with
> columns/beads) correctly. The text is not in the write order as it
> intermixes text from different beads. Try it on an academic paper.
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been
> added as part of the commit rev 1029510, see
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the
> Context object, but for the time being wouldn't it make sense to set
> setSortByPosition to the default value of false? I think that this would be
> the best option for most cases where docs have columns.
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira