[
https://issues.apache.org/jira/browse/TIKA-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Landwehr closed TIKA-2052.
------------------------------------
Resolution: Not A Problem
PDFBox issue ...
> Words are separated where there the letters are spaced together in the PDF
> document
> -----------------------------------------------------------------------------------
>
> Key: TIKA-2052
> URL: https://issues.apache.org/jira/browse/TIKA-2052
> Project: Tika
> Issue Type: Bug
> Reporter: Sebastian Landwehr
>
> For example in the following document:
> https://www.g-ba.de/downloads/39-261-2062/2014-08-21_QSKH-RL_Q-Report_2013.pdf
> Searching for "onsimpulse des Herzschrittmachers" finds the location where
> "Herzschrittmacher" is separated into "Herzschrittma chers". This is
> especially problematic when using the PDF for full text search because often
> such end syllables are found which are not really part of the content. The
> whitespace config parameter did not help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)