[
https://issues.apache.org/jira/browse/TIKA-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413701#comment-15413701
]
Tim Allison commented on TIKA-2052:
-----------------------------------
Sorry. I suspect this is a PDF issue, rather than PDFBox, but if they can fix
it, great! Thank you for opening this.
> Words are separated where there the letters are spaced together in the PDF
> document
> -----------------------------------------------------------------------------------
>
> Key: TIKA-2052
> URL: https://issues.apache.org/jira/browse/TIKA-2052
> Project: Tika
> Issue Type: Bug
> Reporter: Sebastian Landwehr
>
> For example in the following document:
> https://www.g-ba.de/downloads/39-261-2062/2014-08-21_QSKH-RL_Q-Report_2013.pdf
> Searching for "onsimpulse des Herzschrittmachers" finds the location where
> "Herzschrittmacher" is separated into "Herzschrittma chers". This is
> especially problematic when using the PDF for full text search because often
> such end syllables are found which are not really part of the content. The
> whitespace config parameter did not help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)