2010/12/10 Michael Schmitz <[email protected]>:
> Hi,
>
> I don't think the current snapshot is parsing articles (pdfs with
> columns/beads) correctly.  The text is not in the write order as it
> intermixes text from different beads.  Try it on an academic paper.
>
> http://turing.cs.washington.edu/papers/acl08.pdf
>
> Tika App 0.8 parses the text in the right order but omits spaces.  PDFBox
> 1.3.1 parses the file wonderfully.  I attached a parsing of the pdf using
> each utility.
>
> Peace.  Michael
>
>

Could be related to https://issues.apache.org/jira/browse/TIKA-548

Reply via email to