[
https://issues.apache.org/jira/browse/TIKA-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413601#comment-15413601
]
Tim Allison commented on TIKA-2052:
-----------------------------------
Y, this is a problem with PDFs generally. Try the troubleshooting
recommendations on our
[wiki|https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems].
> Words are separated where there the letters are spaced together in the PDF
> document
> -----------------------------------------------------------------------------------
>
> Key: TIKA-2052
> URL: https://issues.apache.org/jira/browse/TIKA-2052
> Project: Tika
> Issue Type: Bug
> Reporter: Sebastian Landwehr
>
> For example in the following document:
> https://www.g-ba.de/downloads/39-261-2062/2014-08-21_QSKH-RL_Q-Report_2013.pdf
> Searching for "onsimpulse des Herzschrittmachers" finds the location where
> "Herzschrittmacher" is separated into "Herzschrittma chers". This is
> especially problematic when using the PDF for full text search because often
> such end syllables are found which are not really part of the content. The
> whitespace config parameter did not help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)