[jira] [Closed] (TIKA-3307) extracted text strings have repeated characters

Tilman Hausherr (Jira) Thu, 11 Mar 2021 10:27:05 -0800


     [ 
https://issues.apache.org/jira/browse/TIKA-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr closed TIKA-3307.
---------------------------------
    Resolution: Not A Bug

> extracted text strings have repeated characters
> -----------------------------------------------
>
>                 Key: TIKA-3307
>                 URL: https://issues.apache.org/jira/browse/TIKA-3307
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Paul Tyson
>            Priority: Major
>         Attachments: WSHP-PRC025F-EN_07132019.pdf
>
>
> Extracted text from some PDF files includes some strings with repeated 
> (doubled) characters.
> To reproduce the problem, download attached PDF file and run the following 
> command:
> {code:java}
> java -jar ./tika-app-1.25.jar -T WSHP-PRC025F-EN_07132019.pdf | egrep 
> '(.)\1(.)\2'
> {code}
> The bad strings all seem to be headings, so perhaps something in the font or 
> other style features are causing the problem.
> First detected in version 1.19, retested with 1.25. Did not test earlier 
> versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (TIKA-3307) extracted text strings have repeated characters

Reply via email to