[jira] [Commented] (TIKA-611) PDFParser mixes the text from separate columns

Michael McCandless (JIRA) Wed, 24 Aug 2011 05:20:05 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090161#comment-13090161
 ]


Michael McCandless commented on TIKA-611:
-----------------------------------------

Here's the javadoc describing PDFBox's setSortByPosition:

  
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html#setSortByPosition(boolean)

The javadoc actually makes "false" sound spooky since it seems to mean the text 
can come out jumbled; it's curious that "false" fixes this two-columns PDF 
case...

Does anyone have a simple small PDF sample doc w/ columns that we could at 
least add as test cases here, so we can catch changes in how such PDFs are 
handled by Tika, going forward?

Separately I agree we really need TIKA-612 here... it seems like it's very 
usage-dependent.

> PDFParser mixes the text from separate columns
> ----------------------------------------------
>
>                 Key: TIKA-611
>                 URL: https://issues.apache.org/jira/browse/TIKA-611
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.0
>
>
> As reported on the dev list by  Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with 
> columns/beads) correctly.  The text is not in the write order as it 
> intermixes text from different beads.  Try it on an academic paper. 
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which 
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been 
> added as part of the commit rev 1029510, see 
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the 
> Context object, but for the time being wouldn't it make sense to set 
> setSortByPosition to the default value of false? I think that this would be 
> the best option for most cases where docs have columns.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-611) PDFParser mixes the text from separate columns

Reply via email to