2010/12/10 Michael Schmitz <[email protected]>: > Hi, > > I don't think the current snapshot is parsing articles (pdfs with > columns/beads) correctly. The text is not in the write order as it > intermixes text from different beads. Try it on an academic paper. > > http://turing.cs.washington.edu/papers/acl08.pdf > > Tika App 0.8 parses the text in the right order but omits spaces. PDFBox > 1.3.1 parses the file wonderfully. I attached a parsing of the pdf using > each utility. > > Peace. Michael > >
Could be related to https://issues.apache.org/jira/browse/TIKA-548
