[
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090161#comment-13090161
]
Michael McCandless commented on TIKA-611:
-----------------------------------------
Here's the javadoc describing PDFBox's setSortByPosition:
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html#setSortByPosition(boolean)
The javadoc actually makes "false" sound spooky since it seems to mean the text
can come out jumbled; it's curious that "false" fixes this two-columns PDF
case...
Does anyone have a simple small PDF sample doc w/ columns that we could at
least add as test cases here, so we can catch changes in how such PDFs are
handled by Tika, going forward?
Separately I agree we really need TIKA-612 here... it seems like it's very
usage-dependent.
> PDFParser mixes the text from separate columns
> ----------------------------------------------
>
> Key: TIKA-611
> URL: https://issues.apache.org/jira/browse/TIKA-611
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.0
>
>
> As reported on the dev list by Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with
> columns/beads) correctly. The text is not in the write order as it
> intermixes text from different beads. Try it on an academic paper.
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been
> added as part of the commit rev 1029510, see
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the
> Context object, but for the time being wouldn't it make sense to set
> setSortByPosition to the default value of false? I think that this would be
> the best option for most cases where docs have columns.
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira