Hello, Sorry about bringing this up again but https://issues.apache.org/jira/browse/TIKA-548 is a real issue for PDF content extraction. Try tika-app-0.8 on any PDF and you'll see strange concatenations of words in the output. This prevents for example text search on headlines from extracted content, which worked nicely before.
I've tried to fix this myself but the error is somewhere in the interaction between Tika and PDFBox and I'm sure someone with better understanding of that would fix it in no time, since it worked before PDFBox 1.3.1. I have attached a unit test to the issue. /Staffan
