[jira] Commented: (TIKA-611) PDFParser mixes the text from separate columns

Chris A. Mattmann (JIRA) Tue, 08 Mar 2011 08:18:24 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004025#comment-13004025
 ]


Chris A. Mattmann commented on TIKA-611:
----------------------------------------

Well, I'm not sure it's the best thing in all cases: not all PDFs are academic 
research papers.

I'd be +1 for making this a ParseContext param, and allowing it to override the 
default value.


> PDFParser mixes the text from separate columns
> ----------------------------------------------
>
>                 Key: TIKA-611
>                 URL: https://issues.apache.org/jira/browse/TIKA-611
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.0
>
>
> As reported on the dev list by  Michael Schmitz :
> bq. I don't think the current snapshot is parsing articles (pdfs with 
> columns/beads) correctly.  The text is not in the write order as it 
> intermixes text from different beads.  Try it on an academic paper. 
> http://turing.cs.washington.edu/papers/acl08.pdf
> This can be fixed by changing the value of setSortByPosition to false, which 
> is the default value in PDFTextStripper. This line (PDF2XHTML:82) had been 
> added as part of the commit rev 1029510, see 
> https://issues.apache.org/jira/browse/TIKA-446?focusedCommentId=12926787&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12926787
> Ideally we could specify what value to set for these parameters via the 
> Context object, but for the time being wouldn't it make sense to set 
> setSortByPosition to the default value of false? I think that this would be 
> the best option for most cases where docs have columns.
>  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (TIKA-611) PDFParser mixes the text from separate columns

Reply via email to