the PDF content regression

Staffan Wed, 17 Nov 2010 23:32:36 -0800

Hello,

Sorry about bringing this up again but
https://issues.apache.org/jira/browse/TIKA-548 is a real issue for PDF
content extraction. Try tika-app-0.8 on any PDF and you'll see strange
concatenations of words in the output. This prevents for example text
search on headlines from extracted content, which worked nicely
before.


I've tried to fix this myself but the error is somewhere in the
interaction between Tika and PDFBox and I'm sure someone with better
understanding of that would fix it in no time, since it worked before
PDFBox 1.3.1. I have attached a unit test to the issue.

/Staffan

the PDF content regression

Reply via email to