[
https://issues.apache.org/jira/browse/TIKA-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed TIKA-3307.
---------------------------------
Resolution: Not A Bug
> extracted text strings have repeated characters
> -----------------------------------------------
>
> Key: TIKA-3307
> URL: https://issues.apache.org/jira/browse/TIKA-3307
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Paul Tyson
> Priority: Major
> Attachments: WSHP-PRC025F-EN_07132019.pdf
>
>
> Extracted text from some PDF files includes some strings with repeated
> (doubled) characters.
> To reproduce the problem, download attached PDF file and run the following
> command:
> {code:java}
> java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep
> '(.)\1(.)\2'
> {code}
> The bad strings all seem to be headings, so perhaps something in the font or
> other style features are causing the problem.
> First detected in version 1.19, retested with 1.25. Did not test earlier
> versions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)